bioinformatics banner
slider

ExomeDemo

Exome Variant Detection Analysis with TomatoFarmer

  • 1 Introduction
  • 2 Dataset Description
  • 3 Caveats
  • 4 Demo
    • 4.1 Alignment, QC and Variant Calling
      • 4.1.1 Running the pipeline
      • 4.1.2 Alignment Output
      • 4.1.3 QC Metrics Output
      • 4.1.4 Variant Calling Output
    • 4.2 Intersection
    • 4.3 Annotation
    • 4.4 Report

Introduction

This page details a demonstration of using the Bioinformatics Shared Resources tools to analyze a human exome disease study. The goal of the demo is to introduce moab/alta users to the available analysis tools. All of the files discussed in the demo can be downloaded from GNomEx: A1546. All alignment and variant files discussed in this demo can be streamed into IGV. See the data visualization wiki page for details.

Dataset Description

The samples used in this demo were originally analyzed in the publication: NMNAT1 mutations cause Leber congenital amaurosis (pubmed). Five members of family were sequenced, two siblings with Leber congential amaurosis (LCA), one unaffected sibling and their unaffected parents. The researchers decided to perform whole-exome sequencing on the family to try and identify the cause of the disease because none of the 17 known LCA-associated genes contained mutations.

Exomes were captured using the SureSelect 50Mb All Exon Targeted Enrichment kit. Libraries were sequenced on a Illumina HiSeq 2000 using the 2 x 101 bp paired-end format. One sample was sequenced per lane with the goal of achieving > 10x coverage in 92-95% of the targeted regions.

  • SRR504516 - affected sister
  • SRR776598 - affected brother
  • SRR504483 - unaffected brother
  • SRR504515 - unaffected mother
  • SRR504517 - unaffected father

Caveats

  • This demo assumes you are running your analysis on the Moab/Alta servers. If you are running the USeq applications somewhere else, you will need to specify your tabix and annovar database locations as command-line arguments. Please read the USeq help menu before running the software (run with -h or no arguments to get help).
  • TomatoFarmer is currently only compatible with the ember server. In the future, we may add support for the kingspeak server.

Demo

Alignment, QC and Variant Calling

Running the pipeline

Alignment, postprocessing, QC metric generation and variant calling are run using the tomato framework and controlled by USeq's TomatoFarmer Application. Simply run TomatoFarmer in a directory containing a study's fastq files and you should have variant calls in a few days. TomatoFarmer also supports running individual components of the pipeline separately: 1) alignment/recalibration 2) QC metrics and 3) variant calling. Run TomatoFarmer with the -h argument to see a full list of pipelines.

  1. Log into alta or moab using ssh.
    1. If on a mac, open up the terminal program by clicking on the magnifying glass in the upper right and typing terminal. In the terminal window type the command ssh uID@servername, where servername is either hci-alta.hci.utah.edu or hci-moab.hci.utah.edu and uID is your standard UofU identifier.

ssh u0000000@hci-alta.hci.utah.edu

When prompted, type in your password and press enter.

Password:

  1. If on a pc, use the PuTTY application. Put either hci-alta.hci.utah.edu or hci-moab.hci.utah.edu into the Host name field and press the Open button. When you get the Login as prompt, type in your uID.

Login as: u000000

When prompted, type in your password and press enter.

Password:

  1. Navigate to your tomato submission directory: /tomato/version/jobs/USER_DIRECTORY

cd /tomato/version/jobs/u000000

If you don't have a submission directory, contact the core to get one.

  1. Create a new directory for your analysis job.

mkdir demo

  1. Move to the newly created directory.

cd demo

  1. Copy/Upload fastq files (A1546) to the job directory. See this page for a guide on downloading files on the cmd line with fdt.jar. The fastq files must be named in a specific way. First, the read1 fastq file name must end in _1.*.gz and the read2 fastq file name must end in _2.*.gz, where '*' can be any character string without periods. The fastq file name prior to _1/2.*gz must contain the sample name and the flowcell name separated by an underscore. The flowcell name can have internal underscores, the sample name cannot. If your file does not match this format, an error will be thrown. The sample and flowcell name is important if you select a pipeline that uses GATK recalibration methods since de-duplication, recalibration and realignment is done at the both the flowcell and sample level. The sample and flowcell name will be added to bam RG tag regardless of the pipeline you choose.
  2. Run TomatoFarmer in the job directory. There are four required arguments: -d path to the job directory, -e email address, -y analysis pipeline and -p properties file. The optional arguments are described below. We currently suggest running the exome_best pipeline. The settings in this pipeline were determined by calling variants on NA12878 using a wide variety of applications and parameters and comparing the resulting calls to the NIST NA12878 highly confident genotype calls. The exome_best settings resulted in a 0.941 TRP at 0.055 FDR for SNPs and 0.832 TRP at 0.076 FDR for InDels.

java -Xmx10g -jar /home/BioApps/USeq/Apps/TomatoFarmer \ -d /tomato/version/jobs/u000000/demo \ -e tim.mosbruger@hci.utah.edu \ -y exome_best \ -t AgilentAllExon50MB \ -p /home/BioApps/tomatoFarmer_testing/configFiles/tomatoFarmerConfigEmber.txt \ -s DEMO

  • TomatoFarmer will run for a long time, so you should run the command using screen or nohup to eliminate interruptions.
    • nohup
      1. prefix the TomatoFarmer command with nohup and suffix with &.

nohup java -Xmx10g -jar /home/BioApps/USeq/Apps/TomatoFarmer \ -d /tomato/version/jobs/u000000/demo \ -e tim.mosbruger@hci.utah.edu \ -y exome_best \ -t AgilentAllExon50MB \ -p /home/BioApps/tomatoFarmer_testing/configFiles/tomatoFarmerConfigEmber.txt \ -s DEMO &> tf.log &

  1. monitor the run log by opening tf.log

tail -n 1000 -F tf.log

Type control-C to exit tail.

    1. If you need to stop the run and aren't familiar with killing background processes, contact the core.
  • screen
    1. Create new screen session using the command screen -S NAME, where NAME is an alphanumeric string.

screen -S DEMO

  1. Run your TomatoFarmer command
  2. Detach from the screen using by pressing ctrl-a followed by d. You should see the message: [detached] at the bottom of your screen.
  3. When you want to re-enter your screen, use the command screen -r NAME, where NAME is the string used at creation.

screen -r DEMO

  1. If you forget the name of your screen, use the command screen -ls to see all your available screens.
  2. When the command is finished, you can exit the screen using the exit command. You should see the message: [screen is terminating] at the bottom of the screen
  3. Make sure you detach from the screen before moving on to other tasks. It's easy to find yourself in a screen within a screen within a screen, which will make you feel lost, confused and scared. Feel free to contact the core if you have question/problems.
  4. TomatoFarmer writes log information to the screen and log file in the job directory.

java -Xmx10g -jar /home/BioApps/USeq/Apps/TomatoFarmer \ -d /tomato/version/jobs/u0000000/demo \ -e tim.mosbruger@hci.utah.edu \ -y exome_best \ -p /home/BioApps/tomatoFarmer_testing/configFiles/tomatoFarmerConfigEmber.txt \ -t AgilentAllExon50MB \ -s DEMO

    • TomatoFarmer has several optional arguments. This wiki page covers the most important ones, please run TomatoFarmer with the '-h' option to see the full list of options.
      • -t: Target regions. You can use this option to specify the target capture kit used in the analysis. Variants and per-base coverage metrics will only be called within the targeted regions. TomatoFarmer has support for several capture kits, run TomatoFarmer using the -h option to see a list of keywords. You can also use this option to specify your own regions in bed format. This is useful if you are analyzing old data or only care about a small subset of genes. Note, use b37 chromosome naming conventions (e.g. 1,2,3,…). If a capture kit is missing that you think should be reported, contact the core.
      • -s: Name of your study. Many of the output files and metrics data are prefixed with the study name. If you don't specify your own, 'STUDY' will be used.
      • -n: Run variant detection on full genome, instead of splitting by chunks. This option is more efficient if you have a small capture region.
      • -f: Manually set failure rate. If this variable is set, the user will be prompted to override default allowed failures for each analysis step before the analysis begins. If exome_align is set to 3, once four failures are reached across all spawned threads, everything will be shut down. You might think about changing the default settings if you're running lots of samples. Note that outright tomato failures (job never actually starts) don't count against the cap. Must be between 1 and 20.
      • -j: Number of jobs at a time. Don't abuse this, big brother is watching you. Default: 5 jobs
      • -g: Spike in 200 1K genome samples into the variant calling step. There are pros and cons to using the spike-ins. Pros: slightly improved SNP calling and improved VAAST calls. Cons: worse InDel calling and variant calling takes a lot longer. This option is currently limited to member of the core facility.
      • -x: Unsuppress tomato emails. By default, you won't get tomato emails for every single step of the analysis. Use this flag if you wish to receive those emails.
    • TomatoFarmer will exit gracefully. If you stop program using ctrl-c or if you hit too many errors, TomatoFarmer halts existing tomato jobs before exiting.
  1. Once the run finishes, you should receive an email. If the run failed, check the logs to determine the problem. If the final error message is prefixed with ERROR (internal), please contact the core. If the final error message is prefixed ERROR (user), fix the problem and resubmit the job.

Alignment Output

  • A successful alignment run will generate a directory named Alignments. Within this directory, there will be ProcessedAlignments directory containing bam files for each sample and a Jobs directory containing the tomato logs.
  • Job directories will be in the format JOB_SAMPLE_align, where SAMPLE is the name of your sample. These directories will contain the tomato2 cmd.txt, log.txt, stderr.txt and stdout.txt files, which can be used to help troubleshoot problems. If the align job failed and was resubmitted by TomatoFarmer, the previous tomato log.txt, stderr.txt and stdout.txt files will be stored in the sub-directory: logs_TIMESTAMP. If the run was successful, these directories can safely be deleted.
  • The ProcessedAlignments directory contains alignments in bam format and their indexes for each sample in the project.

QC Metrics Output

  • A successful metrics run will generate a directory named Metrics, containing a metrics spreadsheet and job directories for each sample.
  • Job directories will be in the format JOB_SAMPLE_metrics, where SAMPLE is the name of your sample. These directories will contain the tomato cmd.txt, log.txt, stderr.txt and stdout.txt files, which can be used to help troubleshoot problems. If the metrics job failed and was resubmitted by TomatoFarmer, the previous tomato2 log.txt, stderr.txt and stdout.txt files will be stored in the sub-directory: logs_TIMESTAMP. If the run was successful, the job directories can safely be deleted.
  • The metrics spreadsheet contains information for all samples in the study. The images sub-directory contains the PNG images used in the excel worksheet.
  • Demo metrics: metrics

Interpretation:

  • Table 1 shows the alignment percentage to b37, phiX and adapter sequences. Low b37 alignment percentages or high adapter alignment percentages should be red flags.

    Figure 1v3
  • Figure 1 shows error rate across the phiX reads. It isn't uncommon to see the error rate increase slightly across the length of the read or an occasional one cycle spike in the error rate. If the error rate is very high across the entire read or increases drastically across the length of the read, you might consider dropping the sample.

    1v3
  • Table 2 shows the number of duplicates in the sample. The duplicate levels in this project varies from 18% to 91%, which is a huge swing. High levels of duplication is often indicative of a problem with library prep and will drastically affect coverage (see coverage plot in SRR504515)

Table 2v3

  • Table 3 shows alignment statistics after processing and duplicate removal. Ideally, most reads will be in pairs, the strand balance will be equal and the error rate should be low.

    Table 3v3
  • Figure 2 shows the insert size distribution. There should be a single peak between 200-500bp. If the insert size is too low, there will be read overlap and a drop in coverage. GATK is aware of overlapping reads and will handles them appropriately when calling variants.
  • 2v3
  • Figure 4 shows the coverage across the captured region. The captured regions is determined by the -t option. If -t is not specified, CCDS exons are used.

3v3


Variant Calling Output

  • A successful variant calling run will generate directory named Variants, containing variant calls and a Jobs directory.
  • The Job directory will contain a subdirectory for each variant calling chunk, starting from 0. (If you selected to call by chromosome, the directories will be named by chromosomes instead). This directory will contain the tomato cmd.txt, log.txt, stderr.txt and stdout.txt files, which can be used to help troubleshoot problems. If the variant job failed and was resubmitted by TomatoFarmer, the previous tomato log.txt, stderr.txt and stdout.txt files will be stored in the sub-directory: logs_TIMESTAMP. If the run was successful, the job directories can safely be deleted.
  • The Variants directory contains one or more unannotated multi-sample VCF files. (If you used HCI best practices, there will be only one vcf file. If you used the GATK best-practices pipeline, there will be three VCF files)
    • *.raw.vcf/idx - All called variants, no filtering. (Generated using Core best practice and GATK best practice runs)
    • *.filterFieldSetAll.vcf/idx - All called variants. Filtered variants are flagged, but not removed. (Only generated in GATK best practice runs)
    • *.filterFieldSetPassing.vcf/idx - All passing variants. Filtered variants are removed. (Only generated in GATK best practice runs)

Notes:

  • If you opt to use the GATK variant calling piplinelines, you should use exome_bwa_raw if you have fewer than 30 samples and exome_bwa_vqsr if you have more than 30 samples.

Intersection

138,504 variants are called using the exome_best pipeline. In order to reduce the number of possible variants, we are going to run a sample intersection. The researchers hypothesized that both affected individuals are homozygous for the deleterious variant, both parents are heterozygous and the unaffected brother is either heterozygous or homozygous for the common allele. We will use our intersection software to restrict the variant list to regions that match this criteria.

java -Xmx20g -jar /home/BioApps/USeq/Apps/MultiSampleVCFFilter \ -v DEMO.raw.vcf.gz \ -p DEMO.intersection.vcf.gz \ -b \ -n SRR504516,SRR776598,SRR504515,SRR504517,SRR504483 \ -u 2,2,1 \ -l M,H,-M -e

The path to the input file is specified with -v, the path to the output file is specified with -o. Since we want to filter based on genotypes, we specify the -b argument. We then specify the sample names (-n) used in the intersection, ordered by group. SRR504516 and SRR776598 make up the affected group, SRR504515 and SRR504517 make up the carrier group and SRR504483 makes up the non-homozygous mutant group. Next, we specify the number of samples in each group, -u 2,2,1. This means the first two samples are in group1 (affected), the second two samples are in a group2 and the final sample is in group3. Finally, we specify the group requirements with the -l argument. The requirements must be one of:

  • W : genotype must be homozygous common
  • H : genotype must be heterozygous
  • M : genotype must be homozygous rare
  • -W : genotype must NOT be homozygous common
  • -H : genotype must NOT be heterozygous
  • -M : genotype must NOT be homozygous rare

The argument -l M,H,-M means that group1 must be homozygous for the rare allele, group2 must be heterozygous and group3 must be heterozygous or homozygous common. The argument -e forces strict genotypes, meaning variants that are no-called in one or more samples won't be reported.

Note that you can specify group requirements for each sample individually, if you find this easier:

java -Xmx20g -jar /home/BioApps/USeq/Apps/MultiSampleVCFFilter \ -v DEMO.raw.vcf.gz \ -p DEMO.intersection.vcf.gz \ -b \ -n SRR504516,SRR776598,SRR504515,SRR504517,SRR504483 \ -u 1,1,1,1,1 \ -l M,M,H,H,-M -e


MultiSampleVCFFilter also has the option to also filter variants on read depth, genotype quality and record quality. If any combination of these variant quality filters are set along with the genotype filter, samples that fail the quality filters aren't required to pass genotype filtering. However, at least one sample per group must pass the quality filters in order for the variant to be reported. This essentially means that if all the high quality genotype calls match expected genotypes and one poor quality genotype call does not match the expected genotype, the variant won't be filtered out.

After intersection, 695 variants remain.

Notes

  • While strict genotype requirements work well in this demo, incorrect genotype assignment is not an uncommon problem with variant callers. Variant sites are not often missed, but precise genotype calls can be elusive. If you are looking for a homozygous rare variant, it might be worth filtering on homozygous rare and heterozygous if the restrictive intersection doesn't yield results.
  • The input/output vcf files can be uncompressed (vcf) or bgzip compressed (vcf.gz). The application will handle compression internally based on your specification. We recommend using compressed versions because is saves space and is required for viewing in IGV.

Annotation

The next step is adding annotations to the VCF file. We are going to use the USeq application VCFAnnotator, which is essentially wrapper for the Annovar annotation package.

java -Xmx20g -jar /home/BioApps/USeq/Apps/VCFAnnotator -v DEMO.intersection.vcf.gz -o DEMO.annotation.vcf.gz

The path to the input file is specified with -v, the path to the output file is specified with -o. You can specify the suite of annotations you want to add using the -a argument. Running the application without this argument will add all available annotations. Available annotations:

  • ENSEMBL - Ensembl gene annotations
  • REFSEQ - RefSeq gene names
  • TFBS - Transcription factor binding sites
  • SEGDUP - Segmental duplications
  • DGV - Database of genomic variants
  • SIFT - SIFT scores
  • PP2_HVAR - Polyphen2 HVAR score
  • PP2_HVAR_P - Polyhen2 HVAR prediction
  • PP2_HDIV - Polyphen2 HDIV score
  • PP2_HDIV_P - Polyhen2 HDIV prediction
  • LRT - LRT scores
  • LRT_P - LRT prediction
  • MT - Mutation taster scores
  • MT_P - Mutation taster prediction
  • MA - Mutation assessor
  • MA_P - Mutation assessor prediction
  • FATHMM - Fathmm score
  • PHYLOP - Phylop conservation score
  • SIPHY - Shiphy conservation score
  • GERP - Gerp++ conservation score
  • GWAS - GWAS catalog annotations
  • DBSNP - DBSNP annotations (snp137 unflagged by default)
  • ONEK - 1K Genome fequency
  • COSMIC - COSMIC database anntotations
  • ESP - ESP database frequency
  • V-FLAG - Flagged VAAST genes
  • ACMG - ACMG genes
  • NIST - NIST callable regions

Notes

  • Annovar and GATK annotations are both added to the INFO field, which can get very large. If you want to view the Annovar annotations when hovering over the variant in IGV, restrict the INFO field annotations to 'Annovar only' using VCFReporter.
  • If you ran VAAST prior to annotation, you can add VAAST information to you annotation using the -n option.
  • The input/output vcf files can be uncompressed (vcf) or bgzip compressed (vcf.gz). The application will handle compression internally based on your specification. We recommend using compressed versions because is saves space and is required for viewing in IGV.

Report

The final step is creating a tab-delimited report from the annotated VCF file using VCFReporter. This report can be opened in excel for filtering and sorting.

java -Xmx10g -jar /home/BioApps/USeq/Apps/VCFReporter -v DEMO.annotation.vcf.gz -o DEMO.report.txt -a -k -x


The path to the input file is specified with -v, the path to the output file is specified with -o. The output file can be tab-delimited text or a new VCF file depending on the file name you specify. The -d argument allows you select what columns you want to output, -u allows you to drop unwanted columns. If neither -d or -u is specified, all annotations will be reported. If you are only interested in Annovar annotations, use the -a argument. If you are only interested seeing damaging variants (nonsynonymous, frameshift or splicing), use the -x argument. If you want to generate a column key for your tab-delimited output, use the -k option.

We specified -a -x, so we will only get damaging variants in our report, with only Annovar annotations. The reduces our list to 116 variants in 86 genes. 3 of the variants in our list aren't found in dbSNP, ESP or 1K genomes and could be considered 'rare' variants. There are 2 SNPs and 1 InDel. Of the 2 SNPs, only the NMNAT1 variant is considered damaging by SIFT, PolyPhen2 HVAR, Mutation Taster and Mutation Assessor. The researchers confirmed that this variant was the cause of LCA in the affected individuals and NMNAT1 is now considered associated to the disease (see OMIM column).

We generally suggest using the NIST callable regions flag to prioritize variants. In this case, the causal variant fell within a region of the genome that has known CNVs or SVs and isn't included in the NIST callable regions.

GNomEx

TomatoFarmer currently does not have the ability to upload results directly to GNomEx once the analysis is complete. You will need to post the results to GnomEx yourself, or contact your friendly neighborhood analyst for assistance. If you want to do it yourself, please see the data transfer guide and the GNomEx upload guide for details.