bioinformatics banner
slider

Frequently Asked Questions

GNomEx

How do I link directly to GNomEx datasets?

Look up the Analysis, Experiment, or DataTrack number and modify the examples:

Other questions

Many Frequently Asked Questions about GNomEx are answered here: http://hci-scrum.hci.utah.edu/gnomexdoc/?page_id=94

Analysis

Sequencing

Microarray

What is P-value Adjustment?

P value adjustment is an essential part of next-gen sequencing or microarray analysis. Raw or unadjusted p values are calculated for each individual gene in a data set using a test statistic, for example a T test. Since there are 30 - 40,000 different genes in a typical gene expression experiment, you are doing 30 - 40,000 independent statistical tests on the data. Even at a reasonable p value of 0.05 or 0.01 you should expect many many false positive results. P value adjustment using a method such as the Benjamini and Hochberg method or Q value method will calculate a false discovery rate for the whole experiment from the unadjusted p values of individual genes.

Which expression microarray gene selection method should I use?

A comparison of three methods used to select differentially expressed genes from a microarray experiment: SAM, RankProd, LIMMA.

Cody Olsen, Huntsman Cancer Institute, University of Utah, July 2006

You have some microarray data (or you are planning on getting some) and you want to know how to find those interesting genes—the ones that are consistently differentially expressed in your experiment. Not only must you decide how to clean and process the raw data and do background correction and normalization, you must also choose a method with which to analyze your data. Once the data is analyzed, you can compile a list of possibly differentially expressed genes. Among common methods of identifying differentially expressed genes in microarray experiments are: Significance Analysis of Microarrays (SAM), linear models (such as limma in R), and non-parametric methods such as Rank Product Analysis (RankProd in R). Hopefully this quick comparison and overview will aid you in making a decision.


Data from Ed Levine’s lab at the Moran Eye Center at the University of Utah was used to compare these three methods. Three 2-color Agilent microarrays were treated with unique pooled samples of P0Het on the Cy3 channel, and P0Null on the Cy5 channel. Microarrays were processed in the University of Utah Microarray Shared Resource Lab. A list of genes which were differentially expressed between the two classes P0Het-P0Null was of interest.


The three methods were applied to the same set of normalized data, and three lists of 300 genes were obtained. Two important factors used to estimate whether a gene is differentially expressed are the magnitude of the difference between groups and the variability of the gene. Since each of the three methods estimate and deal with these quantities differently, we will not get the same list of differentially expressed genes from all three methods.


Linear Models

The R package “limma” is used to create and test linear models for microarray data. Limma uses a moderated t-statistic to test the average difference in log expression levels between the two groups for each gene. The moderated t-statistic is the average log ratio divided by a standard error which is calculated using information from the replicates of the given gene and information from across all genes. Once all possible tests have been done, a variety of multiple comparison procedures are available to control for the false discovery rate of the experiment.(Smythe, 2004)


The t-statistic used in limma assures that the final list of genes includes genes that are consistently different between groups. Limma will choose a gene that is moderately different and consistent, before it will choose a gene that is extremely different each sample, but whose expression is highly variable. A profile plot of the top 300 genes selected by limma is shown in the figure below.


Profile of LIMMA's Top 300 Genes

Profile of Lima Top 300


Some of the most extremely expressed genes are not selected by limma, while genes who are moderately over or under expressed but which have little variability across sample are chosen.


Why should I use a linear model?

Linear models are powerful and can deal with complicated experimental designs in an elegant way. Linear models are a standard statistical tool, and one can get a traditional t-statistic from limma in addition to the moderated one the authors of the package suggest using. There are also other statistics that can be calculated including Bayesian estimators. Of course, certain assumptions are made when fitting a linear model, which are not met in the microarray case, but similar assumptions are made for the other two methods as well. Limma is a package in R’s Bioconductor which is free open source software.


SAM

Significance Analysis of Microarrays uses a test statistic similar to that used by limma. SAM uses a different method to calculate the variability of a gene using information from the specific gene as well information from other genes. Although the test statistic is similar to that used by limma, SAM will choose a slightly different set of genes. The figure below shows a profile plot of genes chosen by SAM.


Profile of SAM's Top 300 Genes

SAM Top 300

In this experiment, SAM didn’t choose any genes which were under-expressed. They seem to have been too variable. SAM seems to value consistency more than limma does, and will choose genes with a smaller difference in expression as long as that difference is consistent across samples.

Why should I use SAM?

SAM is free for academic users and can be easily used in Excel or R. The genes selected by SAM will be very consistent, if not greatly over or under expressed. Like limma, SAM uses a t-test and one can expect a list similar to one obtained from limma, although p-values (SAM uses a “q-value”) will differ.


RankProd

The RankProd package in R can be used to analyze microarray data with a rank-based nonparametric method which may be able to identify differentially expressed genes not identified by the other two methods.


The RankProd statistic for a gene is the geometric mean rank of that gene and was propose by Breitling et al as a useful test statistic in the microarray case (2004). Its distribution is estimated by randomly permuting the observed ranks. Using this estimated distribution, RankProd gives estimates of the false discovery rate for each gene. RankProd calls this the “PFP" which is interpreted as the false discovery rate for the whole list of genes which have an equal or smaller p-value. The procedure ranks each gene two ways, from most to least expressed, and from least to most expressed.


The rank product of a gene is affected by extreme ranks more than, say, the sum of expression values is affected by extreme expression values. The rank product analysis, therefore, picks up genes that are extremely expressed in at least one sample, even though they are less extreme in other samples. A profile plot of genes selected by RankProd is shown below.


Profile of RankProd's Top 300 Genes

Prod Top 300

One can see that a whole group of genes who were extremely under expressed in one or two samples are chosen up by RankProd, but were not chosen by limma or SAM due to their variability. RankProd seems to value large differences more than small standard errors. Unlike the first two lists, genes with missing values are included in RankProd’s top 300.


Why choose RankProd?

Rank Prod seems to be more in line with a Fold-Change criterion and may select genes with more biological significance according to Breitling et al (2004). Genes selected by RankProd might be less consistent across samples, but will have large differences in some of the samples.


How do they Compare?

Below is a Venn Diagram showing the overlap between methods in this case. SAM and limma had the most overlap (268 common genes) while limma and RankProd were the least similar pair (152 common genes). The list of genes from RankProd was the most unique, with 122 genes not found significant by either of the other methods. There was some agreement between all three methods though; half of the genes selected by each method were significant in both of the other two.

SAM Rank Prod


More Information

R / Biocionductor:

LIMMA:

RankProd:

SAM:

How do I run the TiMAT2 CorrelationMap application on chIP-chip promoter array data?

##Here's an example:  #Make all of the possible intervals from a T2 run serialized window file java -jar -Xmx1000M ~/Apps/IntervalMaker -s -50 -i 1 -o 2 -g 250 -z 60 -f \  /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win  #Make a text file containing chromosome, start, and stop for each interval,  #  this represents all of the interrogated promoters on the zebrafish array java -jar -Xmx1000M ~/Apps/IntervalReportPrinter -c -f \  /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win1Indx86672   #Find the best window within each promoter (if desired use the -m option to  #  find the lowest scoring window for identifying reduced regions from a diff analysis) java -jar -Xmx1000M ~/Apps/BestWindowScoreExtractor -w \  /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win \  -r /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win1Indx86672.xls \  -z 60 -i 1 > /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.xls  #Parse the output file printing the row number as the first column, the  #   first four columns, and skipping the first four lines java -jar ~/Apps/PrintSelectColumns -i 0,1,2,3 -n 4 -r -f \  /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.xls   #Run the CorrelationMap application on the parsed file java -jar ~/Apps/CorrelationMaps -w 1000000 -g zv7 -f \  /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.PSC.xls

How do I convert TiMAT2 xxx.bar files to text files?

There is a converter called Bar2Gr in the T2 package. Download it from SourceForge or use the installed version on hci-bio.

nixlaptop:~ nix$ ssh u0028003@hci-bio u0028003@hci-bio's password:  Last login: Wed Aug  8 09:36:36 2007 from 155.100.234.87 [u0028003@hci-bio ~]$ java -jar /home/BioApps/T2/Apps/Bar2Gr   ************************************************************************************** **                                 Bar2Gr: Nov 2006                                 ** ************************************************************************************** Converts xxx.bar to text xxx.gr files.  -f The full path directory/file name for your xxx.bar file(s).  Example: java -Xmx1500M -jar pathTo/T2/Apps/Bar2Gr -f /affy/BarFiles/   **************************************************************************************

How do I convert between different gene names?

There are several web sites, that can get you 90% of the way there. Check out these two http://discover.nci.nih.gov/matchminer/index.jsp and http://david.abcc.ncifcrf.gov/conversion.jsp . For those with no match try manually punching the name into the UCSC and Ensembl search bars (http://www.ensembl.org/index.htmlhttp://genome.ucsc.edu/cgi-bin/hgGateway).

Data

Sequencing

How do I convert refFlat or refSeq UCSC gene table files into brs (binary ref seq) files for uploading to a DAS/2 server like GNomEx?

  • Follow the initial steps for retrieving and uncompressing the igb_exe.jar file as described for converting fasta to bnib files, see above.
  • Launch the BrsParser (java -Xmx2G -cp genometry.jar:sam--igbext.bnd.jar com.affymetrix.genometryImpl.parsers.BrsParser) to get the usage information. Note you must launch this from the IGBCode directory.
  • Launch the BrsParser on a refFlat file (e.g. java -Xmx2G -cp genometry.jar:sam--igbext.bnd.jar com.affymetrix.genometryImpl.parsers.BrsParser /my/anno/hg19EnsGenes.ucsc /my/anno/hg19EnsGenes.brs)

What is the md5_checksums.txt file?

The Fastq folder for each sequencing experiment contains, in addition to the compressed Fastq files themselves, a file named "md5_checksums.txt". This file contains the MD5 checksum for each Fastq file from your experiment on our server. The MD5 checksum is a commonly-used method to check data integrity. Once you have downloaded your sequence data files from GNomEx, you can calculate the MD5 checksums on your local copy of the files and compare them with the checksums in md5_checksum.txt. The numbers should match exactly. If they do not match, the file was corrupted during transfer and must be downloaded again.

To calculate the MD5 checksum, use the "md5sum" command on Linux or the "md5" command in the terminal on Mac OS X:

$ md5sum 9835X1_130311_SN141_0663_BC1P97ACXX_5.txt.gz  946235368cf6b055f96b957c6640d8ea  9835X1_130311_SN141_0663_BC1P97ACXX_5.txt.gz

How do I convert fasta files into bnib files for uploading to a DAS/2 server like GNomEx?

  • Make a directory called IGBCode (mkdir IGBCode)
  • Change into that directory (cd IGBCode)
  • Download the IGB.zip file (NOT Webstart) from the latest IGB release, http://sourceforge.net/projects/genoviz/files/ into the IGBCode directory
  • Unzip IGB.zip (unzip IGB.zip)
  • Move the igb_exe.jar file from the uncompressed igb folder to the IGBCode folder (mv igb/igb_exe.jar .)
  • Unzip the igb_exe.jar file (unzip igb_exe.jar)
  • Launch the NibbleParser (java -Xmx2G -cp genometry.jar:sam--igbext.bnd.jar com.affymetrix.genometryImpl.parsers.NibbleResiduesParser) to get the usage information. Note you must launch this from the IGBCode directory.
  • Launch the Nibble Parser on a directory of uncompressed xxx.fasta files (e.g. for x in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X Y M; do echo $x; java -Xmx2G -cp genometry.jar:sam--igbext.bnd.jar com.affymetrix.genometryImpl.parsers.NibbleResiduesParser chr$x /my/fasta/files/chr$x".fasta" /my/bnib/files/chr$x".bnib" H_sapiens_Feb_2009; done
  • Save the chromosome length information for each converted file, this will be needed by GNomEx for creating a new genome build

Microarray

What are all these Agilent data files?

  • The data files produced during an Agilent microarray experiment (whether gene expression, CGH, or ChIP-chip) include raw data in the form of TIFF images, low-resolution JPEG images for visual examination, quality control files, and numerical data in the form of a text file. All the files for a particular array will have names that begin with the array's experiment number. For example, in request number 5054R, the names of all the data files for the first array will begin with "5054E1".
  • Here is a list of files produced from a typical gene expression experiment:

experiment_251486815301_S01_GE2-v5_95_Aug07_1_1.jpg

JPEG image of array

experiment_251486815301_S01_GE2-v5_95_Aug07_1_1.pdf

quality control report

experiment_251486815301_S01_GE2-v5_95_Aug07_1_1.txt

Text file with the numeric data for each spot on the array

request.fep

XML document that contains the parameters used for this run of the Agilent Feature Extraction software - the software that translates the image of the array into numerical data

request_200708151242.rtf

RTF (Word) document with the Project Run Summary for the array, a brief report describing when the array was scanned, the array's format, and the number of saturated spots

request_251486815301_S01_H.tif

TIFF from high-intensity scan

request_251486815301_S01_L.tif

TIFF from low-intensity scan

H002334_LastBatchReport.rtf

same as the Project Run Summary

QCReport_Graphs

A folder that contains files used in the Quality Control Report

Where do I get the design file for an Agilent array?

The layout of each Agilent microarray is described in a "design file" which is needed for loading Agilent-format microarray data into analysis software such as MeV or Agilent CGH Analytics.

Download

The Agilent array design files can be obtained from Agilent. You will need a microarray bar code to download the design file. The bar code number of each array is a 12-digit number (typically beginning with "251...") that is embedded in the names of your Agilent microarray data files. You can also find the barcode number in the header of the .txt format Agilent data file. The barcode should be on row 3, directly beneath a cell with the word "FeatureExtractor_Barcode". If the data file is opened in Excel, the barcode is usually in cell T3.

Formats

The design files are available in several different formats. The format you need depends on which analysis software you use.

Agilent CGH Analytics 

GEML (.xml) format

MeV 

Tab-delimited text (.txt) format

Agilent scanner configuration

DNA: Back of slide

Barcode: Left side

Scan: Landscape

What analysis software is available?

See Software

How do I report a problem or issue with a USeq application?

Please fill out a bug report, do not send an email, these get buried and lost.


This bug report should contain at minimum 3 items:

  • A short description of the problem.
  • A complete copy of the command line used and all terminal output.
  • Links to all of the files used to execute the command line. Place them in a web accessible directory, on moab/ alta/ ember5, or (if you have an account) in GNomEx under a new Analysis.


Without these 3 essential items, we will not be able to reproduce your bug and your bug report will in all likelihood be deleted.

How do I upload my results to GEO or ArrayExpress?

How do I get the sequence for a particular microarray probe?

Agilent Probes Agilent probe sequences are frequently reported in the .txt files that contain the raw data for an Agilent microarray experiment. These are tab-delimited text files, easily opened in Microsoft Excel. Check there first.

Agilent probe sequences are also available via Agilent's eArray web site. You will need a login to use this site, which can be obtained from Brian Dalley or the Bioinformatics Core.

Step by step:

Once you've logged in to eArray, follow these steps to find the sequence of a probe:

  • Click on the "Probe" tab.
  • Select "Gene Symbol", "Accessions", or "ProbeID" from the Search type menu, whichever is appropriate for your search.
  • In the "Search Term" field enter a probe id (e.g. A_44_P409518), an accession number (e.g. NM_057188) or a gene symbol (e.g. Gmpr).
  • Select a species from the Species drop-down menu.
  • Select a Folder for the search: either "Agilent Catalog" for Agilent-designed probes (including commercial arrays) or "University of Utah" for custom-designed probes.
  • Click the "Search" button.
  • Finally, click on the "View" link for your probe of interest at the bottom of the page.

Affymetrix Probes Affymetrix array probe sequences can be found by searching for the platform (array) identifier at the Affymetrix web site (http://www.affymetrix.com/support/help/index.affx).