- Main Page
- Launch GNomeEx
- Launch 3G IPA
- Launch 2G IGB
- Cluster Computing with Pysano
- Tomato Analysis
- USeq Analysis
- Contact Us
- Utah Core Web Sites
- Cancer Genomics Resources
Look up the Analysis, Experiment, or DataTrack number and modify the examples:
Many Frequently Asked Questions about GNomEx are answered here: http://hci-scrum.hci.utah.edu/gnomexdoc/?page_id=94
P value adjustment is an essential part of next-gen sequencing or microarray analysis. Raw or unadjusted p values are calculated for each individual gene in a data set using a test statistic, for example a T test. Since there are 30 - 40,000 different genes in a typical gene expression experiment, you are doing 30 - 40,000 independent statistical tests on the data. Even at a reasonable p value of 0.05 or 0.01 you should expect many many false positive results. P value adjustment using a method such as the Benjamini and Hochberg method or Q value method will calculate a false discovery rate for the whole experiment from the unadjusted p values of individual genes.
A comparison of three methods used to select differentially expressed genes from a microarray experiment: SAM, RankProd, LIMMA.
Cody Olsen, Huntsman Cancer Institute, University of Utah, July 2006
You have some microarray data (or you are planning on getting some) and you want to know how to find those interesting genes—the ones that are consistently differentially expressed in your experiment. Not only must you decide how to clean and process the raw data and do background correction and normalization, you must also choose a method with which to analyze your data. Once the data is analyzed, you can compile a list of possibly differentially expressed genes. Among common methods of identifying differentially expressed genes in microarray experiments are: Significance Analysis of Microarrays (SAM), linear models (such as limma in R), and non-parametric methods such as Rank Product Analysis (RankProd in R). Hopefully this quick comparison and overview will aid you in making a decision.
Data from Ed Levine’s lab at the Moran Eye Center at the University of Utah was used to compare these three methods. Three 2-color Agilent microarrays were treated with unique pooled samples of P0Het on the Cy3 channel, and P0Null on the Cy5 channel. Microarrays were processed in the University of Utah Microarray Shared Resource Lab. A list of genes which were differentially expressed between the two classes P0Het-P0Null was of interest.
The three methods were applied to the same set of normalized data, and three lists of 300 genes were obtained. Two important factors used to estimate whether a gene is differentially expressed are the magnitude of the difference between groups and the variability of the gene. Since each of the three methods estimate and deal with these quantities differently, we will not get the same list of differentially expressed genes from all three methods.
The R package “limma” is used to create and test linear models for microarray data. Limma uses a moderated t-statistic to test the average difference in log expression levels between the two groups for each gene. The moderated t-statistic is the average log ratio divided by a standard error which is calculated using information from the replicates of the given gene and information from across all genes. Once all possible tests have been done, a variety of multiple comparison procedures are available to control for the false discovery rate of the experiment.(Smythe, 2004)
The t-statistic used in limma assures that the final list of genes includes genes that are consistently different between groups. Limma will choose a gene that is moderately different and consistent, before it will choose a gene that is extremely different each sample, but whose expression is highly variable. A profile plot of the top 300 genes selected by limma is shown in the figure below.
Profile of LIMMA's Top 300 Genes
Some of the most extremely expressed genes are not selected by limma, while genes who are moderately over or under expressed but which have little variability across sample are chosen.
Why should I use a linear model?
Linear models are powerful and can deal with complicated experimental designs in an elegant way. Linear models are a standard statistical tool, and one can get a traditional t-statistic from limma in addition to the moderated one the authors of the package suggest using. There are also other statistics that can be calculated including Bayesian estimators. Of course, certain assumptions are made when fitting a linear model, which are not met in the microarray case, but similar assumptions are made for the other two methods as well. Limma is a package in R’s Bioconductor which is free open source software.
Significance Analysis of Microarrays uses a test statistic similar to that used by limma. SAM uses a different method to calculate the variability of a gene using information from the specific gene as well information from other genes. Although the test statistic is similar to that used by limma, SAM will choose a slightly different set of genes. The figure below shows a profile plot of genes chosen by SAM.
Profile of SAM's Top 300 Genes
In this experiment, SAM didn’t choose any genes which were under-expressed. They seem to have been too variable. SAM seems to value consistency more than limma does, and will choose genes with a smaller difference in expression as long as that difference is consistent across samples.
Why should I use SAM?
SAM is free for academic users and can be easily used in Excel or R. The genes selected by SAM will be very consistent, if not greatly over or under expressed. Like limma, SAM uses a t-test and one can expect a list similar to one obtained from limma, although p-values (SAM uses a “q-value”) will differ.
The RankProd package in R can be used to analyze microarray data with a rank-based nonparametric method which may be able to identify differentially expressed genes not identified by the other two methods.
The RankProd statistic for a gene is the geometric mean rank of that gene and was propose by Breitling et al as a useful test statistic in the microarray case (2004). Its distribution is estimated by randomly permuting the observed ranks. Using this estimated distribution, RankProd gives estimates of the false discovery rate for each gene. RankProd calls this the “PFP" which is interpreted as the false discovery rate for the whole list of genes which have an equal or smaller p-value. The procedure ranks each gene two ways, from most to least expressed, and from least to most expressed.
The rank product of a gene is affected by extreme ranks more than, say, the sum of expression values is affected by extreme expression values. The rank product analysis, therefore, picks up genes that are extremely expressed in at least one sample, even though they are less extreme in other samples. A profile plot of genes selected by RankProd is shown below.
Profile of RankProd's Top 300 Genes
One can see that a whole group of genes who were extremely under expressed in one or two samples are chosen up by RankProd, but were not chosen by limma or SAM due to their variability. RankProd seems to value large differences more than small standard errors. Unlike the first two lists, genes with missing values are included in RankProd’s top 300.
Why choose RankProd?
Rank Prod seems to be more in line with a Fold-Change criterion and may select genes with more biological significance according to Breitling et al (2004). Genes selected by RankProd might be less consistent across samples, but will have large differences in some of the samples.
How do they Compare?
Below is a Venn Diagram showing the overlap between methods in this case. SAM and limma had the most overlap (268 common genes) while limma and RankProd were the least similar pair (152 common genes). The list of genes from RankProd was the most unique, with 122 genes not found significant by either of the other methods. There was some agreement between all three methods though; half of the genes selected by each method were significant in both of the other two.
R / Biocionductor:
|##Here's an example: #Make all of the possible intervals from a T2 run serialized window file java -jar -Xmx1000M ~/Apps/IntervalMaker -s -50 -i 1 -o 2 -g 250 -z 60 -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win #Make a text file containing chromosome, start, and stop for each interval, # this represents all of the interrogated promoters on the zebrafish array java -jar -Xmx1000M ~/Apps/IntervalReportPrinter -c -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win1Indx86672 #Find the best window within each promoter (if desired use the -m option to # find the lowest scoring window for identifying reduced regions from a diff analysis) java -jar -Xmx1000M ~/Apps/BestWindowScoreExtractor -w \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win \ -r /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win1Indx86672.xls \ -z 60 -i 1 > /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.xls #Parse the output file printing the row number as the first column, the # first four columns, and skipping the first four lines java -jar ~/Apps/PrintSelectColumns -i 0,1,2,3 -n 4 -r -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.xls #Run the CorrelationMap application on the parsed file java -jar ~/Apps/CorrelationMaps -w 1000000 -g zv7 -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.PSC.xls|
There is a converter called Bar2Gr in the T2 package. Download it from SourceForge or use the installed version on hci-bio.
|nixlaptop:~ nix$ ssh u0028003@hci-bio u0028003@hci-bio's password: Last login: Wed Aug 8 09:36:36 2007 from 184.108.40.206 [u0028003@hci-bio ~]$ java -jar /home/BioApps/T2/Apps/Bar2Gr ************************************************************************************** ** Bar2Gr: Nov 2006 ** ************************************************************************************** Converts xxx.bar to text xxx.gr files. -f The full path directory/file name for your xxx.bar file(s). Example: java -Xmx1500M -jar pathTo/T2/Apps/Bar2Gr -f /affy/BarFiles/ **************************************************************************************|
There are several web sites, that can get you 90% of the way there. Check out these two http://discover.nci.nih.gov/matchminer/index.jsp and http://david.abcc.ncifcrf.gov/conversion.jsp . For those with no match try manually punching the name into the UCSC and Ensembl search bars (http://www.ensembl.org/index.htmlhttp://genome.ucsc.edu/cgi-bin/hgGateway).
The Fastq folder for each sequencing experiment contains, in addition to the compressed Fastq files themselves, a file named "md5_checksums.txt". This file contains the MD5 checksum for each Fastq file from your experiment on our server. The MD5 checksum is a commonly-used method to check data integrity. Once you have downloaded your sequence data files from GNomEx, you can calculate the MD5 checksums on your local copy of the files and compare them with the checksums in md5_checksum.txt. The numbers should match exactly. If they do not match, the file was corrupted during transfer and must be downloaded again.
To calculate the MD5 checksum, use the "md5sum" command on Linux or the "md5" command in the terminal on Mac OS X:
|$ md5sum 9835X1_130311_SN141_0663_BC1P97ACXX_5.txt.gz 946235368cf6b055f96b957c6640d8ea 9835X1_130311_SN141_0663_BC1P97ACXX_5.txt.gz|
JPEG image of array
quality control report
Text file with the numeric data for each spot on the array
XML document that contains the parameters used for this run of the Agilent Feature Extraction software - the software that translates the image of the array into numerical data
RTF (Word) document with the Project Run Summary for the array, a brief report describing when the array was scanned, the array's format, and the number of saturated spots
TIFF from high-intensity scan
TIFF from low-intensity scan
same as the Project Run Summary
A folder that contains files used in the Quality Control Report
The layout of each Agilent microarray is described in a "design file" which is needed for loading Agilent-format microarray data into analysis software such as MeV or Agilent CGH Analytics.
The Agilent array design files can be obtained from Agilent. You will need a microarray bar code to download the design file. The bar code number of each array is a 12-digit number (typically beginning with "251...") that is embedded in the names of your Agilent microarray data files. You can also find the barcode number in the header of the .txt format Agilent data file. The barcode should be on row 3, directly beneath a cell with the word "FeatureExtractor_Barcode". If the data file is opened in Excel, the barcode is usually in cell T3.
The design files are available in several different formats. The format you need depends on which analysis software you use.
Agilent CGH Analytics
GEML (.xml) format
Tab-delimited text (.txt) format
Agilent scanner configuration
DNA: Back of slide
Barcode: Left side
Please fill out a bug report, do not send an email, these get buried and lost.
This bug report should contain at minimum 3 items:
Without these 3 essential items, we will not be able to reproduce your bug and your bug report will in all likelihood be deleted.
Agilent Probes Agilent probe sequences are frequently reported in the .txt files that contain the raw data for an Agilent microarray experiment. These are tab-delimited text files, easily opened in Microsoft Excel. Check there first.
Step by step:
Once you've logged in to eArray, follow these steps to find the sequence of a probe:
Affymetrix Probes Affymetrix array probe sequences can be found by searching for the platform (array) identifier at the Affymetrix web site (http://www.affymetrix.com/support/help/index.affx).