bioinformatics banner
slider

Guide for General User

Demo 1 - Fetch Host Name


To use Tomato, you will need

1. An account on any HCI servers (including hci-alta.hci.utah.edu,
hci-moab.hci.utah.edu)

and

2. An valid email address (e.g. u01234@utah.edu)


For demonstration, assuming your account name in HCI is u01234, your email address is u01234@utah.edu and you are a member of lab labX

Firstly, let us start with a "hello,world" style job which actually runs on clusters.


1. log into hci-moab (or hci-alta, hci-zion) server

$ssh u01234@hci-moab.hci.uath.edu

2. Change to the Tomato's job folder

$cd /tomato/job

3. Make a subfolder as your job folder

$mkdir foo
$cd foo

4. Create a "cmd.txt" script file which has one command "hostname" which will print the name of host machine.

The cmd.txt should like this:

#e u01234@utah.edu
hostname

The header line for email address "#e EMAIL_ADDRESS" is required for all jobs. It must be included in the "cmd.txt".

5. Create an empty "b" file (which means Begin).

$touch b

6. If the "b" file disappeared and comes out a new file "log.txt", that means your job is accepted and placed in the queue. Now watch the job progress. This may take awhile (hrs to days) if the queue has a number of preceding jobs.

$tail -qF log.txt

If everything goes well, you will see a new file "stdout.txt" under your job folder (/tomato/job/foo/). Check it out:

$cat stdout.txt

em241

Here "em241" is the output of command "hostname". It means that your job is processed on "em241" which is cluster node.

The above demo shows the most basic command without any input files.


The key to create a valid Tomato job is to write a text file named "cmd.txt". In the "cmd.txt", you tell Tomato what to do. In the following examples, unless explicitely specified, the text inside a box that startswith "#e u01234@utah.edu" is the "cmd.txt".



Demo 2 - Running Alignments


Now let us do a more realistic work: align reads file to reference genome. The demo data is a piece of human whole genome sequencing on illumina HiSeq2000 platform. It has two pair-end files and stored in GNomEx under experiment "8853R". So basically you need to download the data from GNomEx to your job folder then start the job.

1. Log into GNomEx, guest access is fine.

2. On the top-left corner of the main page you can find "Lookup by Experiment or Analysis #", type "8853R" and hit Enter


Image:tomato_gnomex_1.jpg

3. On the right panel , select tab "Files"

Image:8853_1.jpg

4. Check the "8853R" box and click the 'FDT Download' button, save the file into your desktop. It should be named something like "gnomex.jnlp"

Image:8853_2.jpg

5. Log into hci-moab in a terminal window

$ssh u01234@hci-moab.hci.utah.edu

6. Change to tomato's job directory

$cd /tomato/job/

7. Make a new job directory with your name

$mkdir u01234

8. Switch to new folder

$cd u01234

9. Download the "gnomex.jnlp" from your desktop to current directory with scp, lftp or any other available methods. Alternatively open the jnlp file in a text editor and follow the fdt.jar cmd line download instructions.

[on my computer] $scp genomex.jnlp u01234@moab.hci.utah.edu:.

10. Run the fdt wrapper app on moab to download the data

$fdt gnomex.jnlp

11. If everything goes well, you will see a new subfolder named "8853R"

$ls
8853R\

12. Now change to "8853R/bioanalysis"

$cd 8853R/bioanalysis

13. It should contain three files

$ls
A_1.txt.gz  A_2.txt.gz  cmd.txt

14. Create the "b" file to submit the job to Tomato

$touch b

15. Watch job progress. Depending on where the job is in the queue this may take some time to execute.

$tail -qF log.txt

16. Check out the alignment result file (A.sam.gz) when the job is finished.

$zcat A.sam.gz

or

$less A.sam.gz

Demo 3 - Call SNPs and Indels on Alignments


This time we want to find the SNPs in our read sequence files that were already used on Demo2. Please repeat step 1 - 13 in demo2 in a new job folder.

Now, before making the "b" file and submitting the job, we need modify the "cmd.txt" file in a text editor (e.g. emacs, vi, ...)

Original:

#e u01234@utah.edu
@align -g hg19 -i *.gz

Now:

#e u01234@utah.edu
@snpindel -g hg19 -i *.gz

So what you need to do is just changing "@align" to "@snp" in "cmd.txt"

$cat cmd.txt

#e u01234@utah.edu
@snpindel -g hg19 -i *.gz


At last, create the "b" file to get it started.

$touch b


Do not forget watching the job progress once the job is submitted.

$tail -qF log.txt

When this job is finished, you will see a variance file "A.vcf" in your job folder.

$less A.vcf

In conclusion, to run a job in Tomato, basically you need do following steps:

1. Create a sub folder under /tomato/job/ as the job folder

2. Copy (or soft link) your input files to job folder

3. Write the "cmd.txt" file

4. Start your job by "$touch b"

Demo 4 - Control/Case Analysis



To do a Control/Case Analysis, including tumor/normal mutation identification, RNASeq or ChIPSeq, you need to use a a parameter "-s SAMPLE_FILE".

Supposedly you have following pair-end sequencing files to analyze:

A1_1.txt.gz        A1_2.txt.gz     A2_1.txt.gz     A2_2.txt.gz
B1_1.txt.gz     B1_2.txt.gz     B2_1.txt.gz     B2_2.txt.gz

So in total you have 4 samples. The two samples start with ‘A’ are case samples, and the other two samples start with ‘B’ are control samples. The reference genome is hg19.

Find variants in Tumor/Normal samples

Typically you want to find the variants in the tumor samples while using the normal samples as background control.

#e u01234@utah.edu
@varscan -g hg19 -i *.gz -s sample.txt

Here you can see a new parameter "-s sample.txt". the sample.txt must have this format:

control1,control2,control3,...[TAB]case1,case2,case3,...

#cat sample.txt
B1_1.txt.gz,B1_2.txt.gz,B2_1.txt.gz,B2_2.txt.gz 
A1_1.txt.gz,A1_2.txt.gz,A2_1.txt.gz A2_2.txt.gz

sample.txt has two (or one column, for variants group calling) columns, separated by a TAB. The first column lists the control samples, separated by COMMA. The second column lists the case samples, separated by COMMA as well.

So TAB is used to separated control and case samples, COMMA is used to separate samples in the same group( control or case).

Find variants in Control samples only

cmd.txt

@haplotypecaller -g hg19 -i B*.gz -s sample.txt

#cat sample.txt

B1_1.txt.gz,B1_2.txt.gz,B2_1.txt.gz,B2_2.txt.gz

This time sample.txt has only one column.


So what is the difference if you did not use "-s sample.txt" ?

@haplotypecaller -g hg19 -i B*.gz

This command will output two VCF files, because there are two samples in the B group. It works like

HaplotypeCaller.jar -I B_1.bam -o B1.vcf
HaplotypeCaller.jar -I B_2.bam -o B2.vcf


With "-s sample.txt"

@haplotypecaller -g hg19 -i B*.gz -s sample.txt

This command will output one VCF files from two samples. It works like

HaplotypeCaller.jar -I B_1.bam -I B_2.bam -o B.vcf

Demo 5 - Perform an RNA-Seq primary analysis from raw sequencing fastq data


We're going to make use of Novoalign to align our reads to a special genome index that contains all of the standard chromosomes plus all known and theoretical extended splice junctions. This approach works well for well annotated genomes. See http://useq.sourceforge.net/usageRNASeq.html for details. Once aligned we'll run the USeq RNASeq wrapper application to perform the primary RNA-Seq analysis :

[u0028003@hci-alta GeneAnnotations]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/RNASeq

**************************************************************************************
**                                   RNASeq: July 2012                              **
**************************************************************************************
The RNASeq application is a wrapper for processing RNA-Seq data through a variety of
USeq applications. It uses the DESeq package for calling significant differential
expression.  3-4 biological replicas per condition are strongly recommended. See 
http://useq.sourceforge.net/usageRNASeq.html for details constructing splice indexes,
aligning your reads, and building a proper gene (NOT transcript) table.

The pipeline:
   1) Converts raw sam alignments containing splice junction coordinates into genome
          coordinates outputting sorted bam alignemnts.
   2) Makes relative read depth coverage tracks.
   3) Scores known genes for differential exonic and intronic expression using DESeq
         and alternative splicing with a chi-square test.
   4) Identifies unannotated differentially expressed transfrags using a window
         scan and DESeq.

Use this application as a starting point in your transcriptome analysis.

Options:
-s Save directory, full path.
-t Treatment alignment file directory, full path.  Contained within should be one
       directory per biological replica, each containing one or more raw
       SAM (.gz/.zip OK) files.
-c Control alignment file directory, ditto.  
-n Data is stranded. Only analyze reads from the same strand as the annotation.
-v Genome version (e.g. H_sapiens_Feb_2009, M_musculus_Jul_2007), see UCSC FAQ,
      http://genome.ucsc.edu/FAQ/FAQreleases.
-g UCSC RefFlat or RefSeq gene table file, full path. Tab delimited, see RefSeq Genes
       http://genome.ucsc.edu/cgi-bin/hgTables, (uniqueName1 name2(optional) chrom       
       strand txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
       (commaDelimited)exonEnds). Example: ENSG00000183888 C1orf64 chr1 + 16203317
       16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 . NOTE:
       this table should contain only ONE composite transcript per gene (e.g. use
       Ensembl genes NOT transcripts). Use the MergeUCSCGeneTable app to collapse
       transcripts to genes. See the RNASeq usage guide for details.
-r Full path to R, defaults to '/usr/bin/R'. Be sure to install Ander's DESeq
       (http://www-huber.embl.de/users/anders/DESeq/) R library.

Advanced Options:
-m Combine replicas and run single replica analysis using binomial based statistics,
       defaults to DESeq and a negative binomial test.
-a Maximum alignment score. Defaults to 120, smaller numbers are more stringent.
-d Minimum FDR threshold for filtering windows, defaults to 0.5
-o Don't delete overlapping exons from the gene table.
-e Print verbose output from each application.

Example: java -Xmx2G -jar pathTo/USeq/Apps/RNASeq -v D_rerio_Dec_2008 -t 
      /Data/PolIIMut/ -c /Data/PolIIWT/ -s
      /Data/Results/MutVsWT -g /Anno/zv8Genes.ucsc 

**************************************************************************************

Detailed Steps:

  • Open a terminal/ shell and login to moab or alta.
ssh u0028003@hci-moab.hci.utah.edu
  • Change into your tomato job directory. If you don't have one create it.
mkdir /tomato/job/Nix; cd /tomato/job/Nix
  • Create a new folder to launch the RNASeq job.
mkdir 9434R; cd 9434R
  • Move your raw fastq data into this RNASeq job directory. See Demo 2 - Running Alignments, for details in how to fetch files from GNomEx. Alternatively use the scp, ftp, or mv apps. Soft linking your files also works and is probably the best option.
  • Move in an appropriate UCSC RefFlat/RefSeq gene table. It is critical that this table be properly filtered. See http://useq.sourceforge.net/usageRNASeq.html . It MUST contain composite genes, NOT transcripts. Appropriate gene tables have been created for most genome builds and are publically available from GNomEx under the Transcriptome Analysis Projects folder. When in doubt ask!
cp /home/Genomes/Mouse/Mm10/GeneAnnotations/mm10EnsGenes_JustNormal.ucsc.gz  mm10EnsGenes_JustNormal.ucsc.gz
  • Create an appropriate cmd.txt file with a text editor. The following is an example cmd.txt file for processing standard 50bp polyA random primed TruSeq library RNA-Seq Illumina fastq data to mouse mm10. Things to reconfigure:
    • Set the #e to your email address
    • Set the #a to place the data directly in an existing GNomEx Analysis. This is optional but STRONGLY recommended. You'll need to upload a file into that Analysis so that the appropriate folders are created on the server.
    • Check that the adapter sequence is correct for your library type. See on the GNomEx home screen the "View experiment protocol descriptions" for a listing of all of the adapter primers.
    • Pick an appropriate novoindex to align with. It should be matched to the length of your reads and trimmed of the xxx..nov.illumina.nix extension. See /tomato/data/ for a listing of available indexes.
[u0028003@hci-moab 9434R]$ cat cmd.txt 
#e david.nix@hci.utah.edu
#a A797

#Align with novoalign to genome plus splices, TruSeq RNASeq adapter, one random repeats
@align -novoalign [-o SAM -r All 50 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ] -g mm10EnsTransRad46bpNum100kMin10SplicesChrNormPhiXAdaptr -i *txt.gz

#Delete the fastq data to create space on the cluster node
rm *txt.gz

#Move alignments to appropriate directories
mkdir T C
mkdir T/1 T/2 T/3 C/1 C/2 C/3
mv 9434X1* C/1/
mv 9434X2* C/2/
mv 9434X3* C/3/
mv 9434X4* T/1/
mv 9434X5* T/2/
mv 9434X6* T/3/

#Launch RNASeq app
RNASeq.jar -s RNASeq_8.3.9 -t T -c C -v M_musculus_Dec_2011 -g mm10EnsGenes_JustNormal.ucsc.gz -o -e
  • Here's what the final job folder looks like for this example:
[u0028003@hci-moab 9434R]$ ls
9434X1_120824_SN141_0518_AC11KJACXX_1.txt.gz  9434X4_120824_SN141_0518_AC11KJACXX_1.txt.gz  cmd.txt
9434X2_120824_SN141_0518_AC11KJACXX_1.txt.gz  9434X5_120824_SN141_0518_AC11KJACXX_1.txt.gz  mm10EnsGenes_JustNormal.ucsc.gz
9434X3_120824_SN141_0518_AC11KJACXX_1.txt.gz  9434X6_120824_SN141_0518_AC11KJACXX_1.txt.gz
  • Launch the tomato job, this b file should disappear and you should get an email indicating that the job was accepted.
touch b

Demo 6 - Perform a ChIP-Seq primary analysis from raw sequencing fastq data


We're going to make use of Novoalign to align our reads to a standard genome index. Once aligned we'll run the USeq ChIPSeq wrapper application to perform the primary ChIP-Seq analysis.

[u0028003@hci-alta ~]$ java -jar ~/AppsUSeq/ChIPSeq 

**************************************************************************************
**                                  ChIPSeq: April 2012                             **
**************************************************************************************
The ChIPSeq application is a wrapper for processing ChIP-Seq data through a variety of
USeq applications. It:
   1) Parses raw alignments (sam, eland, bed, or novoalign) into binary PointData
   2) Filters PointData for duplicate alignments
   3) Makes relative ReadCoverage tracks from the PointData (reads per million mapped)
   4) Runs the PeakShiftFinder to estimate the peak shift and optimal window size
   5) Runs the MultipleReplicaScanSeqs to window scan the genome generating enrichment
        tracks using DESeq's negative binomial pvalues and B&H's FDRs
   6) Runs the EnrichedRegionMaker to identify likely chIP peaks (FDR < 1%, >2x).

Note, given R's poor memory management, this app requires 64bit R and >6-8G RAM.

Options:
-s Save directory, full path.
-t Treatment alignment file directories, full path, comma delimited, no spaces, one
       for each biological replica. These should each contain one or more text
       alignment files (gz/zip OK) for a particular replica. Alternatively, provide
       one directory that contains multiple alignment file directories.
-c Control alignment file directories, ditto. 
-y Type of alignments, either novoalign, sam, bed, or eland (sorted or export).
-v Genome version (e.g. H_sapiens_Feb_2009, M_musculus_Jul_2007), see UCSC FAQ,
      http://genome.ucsc.edu/FAQ/FAQreleases.
-r Full path to 64bit R, defaults to '/usr/bin/R'. Be sure to install Ander's DESeq
       (http://www-huber.embl.de/users/anders/DESeq/) R library.

Advanced Options:
-m Combine any replicas and run single replica analysis (ScanSeqs), defaults to
      using DESeq.
-d Minimum FDR threshold for filtering windows, defaults to 0.9
-a Maximum alignment score. Defaults to 60, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
      This is a phred-scaled posterior probability that the mapping position of read
      is incorrect. Set to 0 for RNASeq data.
-p Peak shift, defaults to the PeakShiftFinder peak shift or 150bp. Set to 0 for
      RNASeq data.
-w Window size, defaults to the PeakShiftFinder peak shift + stnd dev or 250bp.
-i Minimum number reads in window, defaults to 10.
-u Convert bar graph folders to xxx.useq format.
-f Filter bed file (tab delimited: chr start stop) to use in excluding intersecting
      windows while making peaks, e.g. satelliteRepeats.bed .
-g Print verbose output from each application.
-e Don't look for reduced regions.
-b Bypass DESeq's variance outlier filtering. Recommended for first pass.

Example: java -Xmx2G -jar pathTo/USeq/Apps/ChIPSeq -y eland -v D_rerio_Dec_2008 -t 
      /Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/PolIINRep1/,/Data/PolIINRep2/ -s
      /Data/Results/WtVsNull -f /Anno/satelliteRepeats.bed

**************************************************************************************

Detailed Steps:

  • Open a terminal/ shell and login to moab or alta.
ssh u0028003@hci-moab.hci.utah.edu
  • Change into your tomato job directory. If you don't have one create it.
mkdir /tomato/job/Nix; cd /tomato/job/Nix
  • Create a new folder to launch the ChIPSeq job.
mkdir 9715R; cd 9715R
  • Move your raw fastq data into this job directory. See Demo 2 - Running Alignments, for details in how to fetch files from GNomEx. Alternatively use the scp, ftp, or mv apps. Soft linking your files also works and is probably the best option.
  • Create an appropriate cmd.txt file with a text editor. The following is an example cmd.txt file for processing standard 50bp TruSeq library ChIP-Seq Illumina fastq data to fly dm3. This contains four chIP replicas and four input replicas. Things to reconfigure:
    • Set the #e to your email address
    • Set the #a to place the data directly in an existing GNomEx Analysis. This is optional but STRONGLY recommended. You'll need to upload a file into the new GNomEx Analysis so that the appropriate folder is created on the server.
    • Check that the adapter sequence is correct for your library type. See on the GNomEx home screen the "View experiment protocol descriptions" for a listing of all of the adapter primers.
    • Pick an appropriate novoindex to align with. See /tomato/data/ for a listing of available indexes.
[u0028003@hci-alta 9715R]$ cat cmd.txt 
#e david.nix@hci.utah.edu
#a A1343

# Align fastq data with novoalign, no repeat matches, trim off adapters and poor quality bases
@align -novoalign [-r None -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -H -k] -g dm3PhiXAdaptr -i *.gz

# Remove fastq data to save space on the ember node
rm *txt.gz

# Make directories and move raw sam files from novoalign into them
mkdir T C
mkdir T/1 T/2 T/3 T/4 C/1 C/2 C/3 C/4
mv 9715X1_* T/1
mv 9715X2_* T/2
mv 9715X3_* T/3
mv 9715X4_* T/4
mv 9715X5_* C/1
mv 9715X6_* C/2
mv 9715X7_* C/3
mv 9715X8_* C/4

# Run USeq ChIPSeq app with and without merging replicas. Note this will only look for enriched regions and is appropriate for a ChIP vs Input comparision. For a ChIP1 vs ChIP2 comparison drop the -e option. 
ChIPSeq.jar -s ChIPSeq -t T -c C -y SAM -v D_melanogaster_Apr_2006 -r /uufs/chpc.utah.edu/common/home/hcibcore/tomato/app/R/lib64/R/bin/R -d 1 -u -e -b
ChIPSeq.jar -s ChIPSeq -t T -c C -y SAM -v D_melanogaster_Apr_2006 -r /uufs/chpc.utah.edu/common/home/hcibcore/tomato/app/R/lib64/R/bin/R -d 1 -u -e -b -m
  • Here's what the final job folder looks like for this example:
[u0028003@hci-alta 9715R]$ ls
9715X1_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz  9715X5_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz  cmd.txt
9715X2_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz  9715X6_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz  
9715X3_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz  9715X7_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz
9715X4_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz  9715X8_121221_SN1117_0137_BC1FV5ACXX_4.txt.gz
  • Launch the tomato job by creating a b file using the touch command. This b file should disappear and you should get an email indicating that the job was accepted.
touch b

Pipelines


Tomato is a pipeline-central framework which means you should use the predefined pipelines to for various tasks. Generally speaking, a predefined pipeline has only one word, and start with a special character @. In addition to the pipeline name, you will need two additional parameter: "-g REFERENCE_GENOME" (for example, "-g hg19") and "-i INPUTS" (for example, "-i *.fq.gz")

Here is a list of default pipelines (as of 03/01/2013).

@align

Example: @align -g hg19 -i *.gz
Description: This pipeline will use novoalign (default aligner) to align your raw sequencing data to a reference genome.

@recal

Example: @recal -g hg19 -i *.gz
Description: This pipeline will do alignment (use novoalign), remove duplicates, realign around known indels and recalibrate the Base Quality Score of alignments. 

Output is a clean BAM file, ready to call variants.

@snpindel

Example: @snpindel -g hg19 -i *.gz
Description: This pipeline will do 
@recal, then use GATK's 
HaplotypeCaller to call SNPs and Indels.

@annovar

Example: @annovar -g hg19 -i *.gz
Description: This pipeline will do 
@snpindel, then use 
ANNOVAR to annotate variants.

@vaast

Example: @snpindel -g hg19 -i *.gz -s [SAMPLE.txt]
Description: This pipeline will do 
@snpindel, then use 
VAAST to annotate variants in control/case analysis.

@varscan

Example: @varscan -g hg19 -i *.gz -s [SAMPLE.txt]
Description: This pipeline will do 
@snpindel, then use 
VarScan to annotate variants in control/case analysis.

@chipseq

Example: @chipseq -g hg19 -i *.gz
Description: This pipeline will do 
@align, then use USeq's 
ChIPSeq.

@rnaseq

Example: @snpindel -g hg19 -i *.gz
Description: This pipeline will do 
@align, then use USeq's 
RNASeq.