GSEA Data Files#
1. How do I create an expression dataset file? What types of expression data can I analyze?#
GSEA requires that expression data be in a RES, GCT, PCL, or TXT file. All four file formats are tab-delimited text files. For details of each file format, see Data Formats.
GenePattern provides several modules for converting expression data into gct and/or res files:
- ExpressionFileCreator converts raw expression data from Affymetrix CEL files.
- GEOImporter and caArrayImportViewer create a GCT file based on expression data extracted from the GEO or caArray microarray expression data repository, respectively.
- MAGEImportViewer module converts MAGE-ML format data. MAGE-ML is the standard format for storing both Affymetrix and cDNA microarray data at the ArrayExpress repository.
To use expression data stored in any other format (such as cDNA microarray data), first convert the data into a tab-delimited text file that contains expression measurements with genes as rows and samples as columns and then modify that text file to comply with the gct file format requirements as described in Expression Datasets in the GSEA User Guide.
If you are using two-color ratio data, see also cDNA Microarray Data.
Several errors can indicate issues with data file formatting particularly:
- "There were errors: ERRORS #:1 Parsing trouble… " "Bad format expect ncols: [x] but found: [y] on line >"
In older versions of GSEA, if a gct, res, or pcl file has a .txt file extension, you will see the parsing error when you load the file into GSEA. Otherwise if a .gct file has .txt file extension without .gct. Check that the file extension matches the file format. Note that some operating systems (such as Windows), can be configured to hide known file extensions. If your operating system is configured to hide known extensions, a file named test.gct.txt will be listed as test.gct. Look at the file type of the file: it should be GCT (or RES or PCL), not Text Document. GSEA 4.x+ Will parse .gct.txt files as .gct.
- "There were errors: ERRORS #:1 Parsing trouble… " "java.lang.NumberFormatException: For input string: [x]"
This indicates that there are non-numeric values in the expression data matrix. A frequent cause of this is that non-numeric strings (such as NA, or NaN) were inserted into the matrix in place of absent data. Absent data should be represented by blank cells.
2. How do I filter or pre-process my dataset for GSEA?#
How you filter or pre-process your data depends on your study and data origin. Here are a few guidelines to consider:
- Probe identifiers versus gene symbols. Microarray datasets contains the probe identifiers native to your platform. GSEA can analyze the probe identifiers or collapse each probe set to a gene vector, where the gene is identified by a gene symbol. Collapsing the probe sets prevents multiple probes per gene from inflating the enrichment scores and facilitates the biological interpretation of analysis results. In order to analyze microarray data with MSigDB gene sets, the microarray probes must be collapsed to gene symbols.
- AP call filters. You can run GSEA on filtered or unfiltered data. Typically, the GSEA team runs the analysis on unfiltered data. One suggested approach is to run GSEA on the unfiltered data. If the results seem dominated by gene sets will poorly expressed genes, you might gain insight into what thresholds to use for the call filters.
- Expression values. The GSEA algorithm examines the differences in expression values rather than the values themselves. For example, you might have natural scale data or logged expression levels; you might have Affymetrix data or two-color ratio data. As in most data analysis methodologies, the same expression data represented in different formats may generate different analysis results. The differences are expected. GSEA cannot determine which results are "correct."
- RNA-seq data. RNA-seq datasets should be preprocessed in accordance with the guidelines for Using RNA-seq Datasets with GSEA.
For more information, see Preparing Data Files in the GSEA User Guide.
3. Should I use natural or log scale data for GSEA?#
We recommend using natural scale data. We used it when we calibrated the GSEA method and it seems to work well in general cases.
Traditional modeling techniques, such as clustering, often benefit from data preprocessing. For example, one might filter expression data to remove genes that have low variance across the dataset and/or log transform the data to make the distribution more symmetric. The GSEA algorithm does not benefit from such preprocessing of the data.
RNA-seq datasets may benefit from the removal of genes that are not expressed in any of the samples in the dataset. See the guidelines for Using RNA-seq Datasets with GSEA.
4. How many samples do I need for GSEA?#
This depends on your specific problem and data characteristics; however, as a general recommendation, if there are fewer than 7 samples per phenotype, GSEA should be run with gene_set rather than phenotype permutation. 3 samples per phenotype are the minimum for the GSEA default signal2noise, and the tTest, ranking metrics.
If you have technical replicates, you generally want to remove them by averaging or some other data reduction technique. For example, assume you have five tumor samples and five control samples each run three times (three replicate columns) for a total of 30 data columns. You would average the three replicate columns for each sample and create a dataset containing 10 data columns (five tumor and five control).
5. How do I create a phenotype label file? What types of experiments can I analyze?#
GSEA can be used to analyze experiments of any type (including time-series, three or more classes, and so on). The phenotype labels (cls) text file defines the experimental phenotypes and associates each sample in your dataset with one of those phenotypes. The cls file is an ASCII plain text tab-delimited file, which you can easily create using a text editor. For more information, see Preparing Data Files in the GSEA User Guide.
6. What gene sets are available? Can I create my own gene sets?#
You can use the gene sets in the Molecular Signature Database (MSigDB) or create your own. For more information about the MSigDB gene sets, see the MSigDB page. For more information about creating gene sets or using gene sets with GSEA, see Preparing Data Files in the GSEA User Guide.
7. How many genes should there be in a gene set?#
GSEA automatically adjusts the enrichment statistics to account for different gene set sizes, as described in the Supplemental Information for the GSEA 2005 PNAS paper, however, as a general guideline, gene sets should optimally contain between 15 and 500 genes.
8. Can GSEA analyze a gene set that contains duplicate genes? duplicate gene sets?#
Duplicate genes in a gene set and duplicate gene sets both effect GSEA results. GSEA automatically removes duplicate genes from each gene set, but does not check for duplicate gene sets. For more information, see Gene Sets in the GSEA User Guide.
9. Can GSEA analyze a gene set that contains genes that are not in my expression dataset?#
The Gene Set Enrichment Analysis application automatically restricts the gene sets to only the genes in the expression dataset. If as a result of this restriction the size of a gene set is reduced to below the gene set minimum size cutoff, the gene set will be excluded from analysis. The analysis report lists the gene sets and the number of genes that were included and excluded from the analysis.
10. What array platforms and species does GSEA support?#
Typically, GSEA uses gene sets from MSigDB. MSigDB consist of a database of gene sets in human gene symbols, and a database of gene sets in mouse gene symbols. GSEA has build-in tools for conversion between a variety of other gene identifiers to gene symbols by means of specially formatted CHIP files. The CHIP files provide the mapping between gene identifiers in your expression data and gene identifiers in the gene sets. Specifically, our CHIP files provide the mappings from all kinds of different platforms (e.g., mouse Affymetrix probe set IDs, human Affymetrix probe set IDs, previous versions of gene symbols, etc.) to either human or mouse gene symbols. Specifically, MSigDB supports the majority of platforms available in the Ensembl Biomart for Human, Mouse, and Rat data.
Human, Mouse, and Rat datasets can be analyzed either against the Human MSigDB Collections or the Mouse MSigDB Collections. For analysis of mouse and rat using the human collections, or anlaysis of human and rat using the mouse collections, ortholog mapping chip files are provided.
For other species you have two options. The first is to prepare your own chip files that perform orthology mapping. The release notes for a given MSigDB version specify the Ensembl release that the current version of MSigDB's gene symbols were retreived from. The annotations in the Ensembl biomart for the specified version can provide a starting place for constructing ortholog chip files. Alternatively, you may choose to provide your own database of gene sets as a GMT or GMX file. The file formats are described here. In either case, you still have to make sure that the gene identifiers in your your data match those in your gene sets database. If the identifiers don't match each other, then you have to also provide a CHIP file with the appropriate mappings. The CHIP file format is described here.
To see what CHIP files are available in our distribution: start GSEA desktop application and click [...] at "Chip platform(s)" on the "Run GSEA" page. The chip files for mapping gene identifers to the human collections are available under the "Human Collection Chips (MSigDB)" tab, and the chip files for mapping gene identifers to the mouse collections are available under the "Mouse Collection Chips (MSigDB)" tab.
If your platform is not in either list, you have the following options:
- Create your own CHIP file to map your platform specific gene identifiers to human or mouse gene symbols and then use your CHIP file to collapse dataset in GSEA. The CHIP file format is described here. Convert your platform identifiers to human or mouse gene symbols outside GSEA, then run GSEA with 'Collapse dataset' = "No_collapse".
- Make sure that gene symbols in the collapsed dataset appear only once. Simply replacing the identifiers with human or mouse gene symbols usually is not sufficient because some of the identifiers can correspond to the same gene symbol, resulting in duplicate rows with different expression values. In this case, GSEA will arbitrarily pick one of the rows with the same gene symbols for the analysis, which we do not recommend.
In addition to "Collapse" which performs a mathematical operation to condense multiple identifiers to a single gene symbol, GSEA also supports a "Remap_only" mode. When working with data that is already in gene symbols, but may have been produced with an old database, it is recommended to attempt to "Remap" these symbols to the current versions used in MSigDB using the "Remap_Only" collapse option with the "Human_Gene_Symbol_with_Remapping" or "Mouse_Gene_Symbol_with_Remapping" .chip file for the target MSigDB version and species.
1. What is the difference between GSEA and an overlap statistic (hypergeometric) analysis tool?#
An overlap statistic analysis tool is typically used with data where a subset of genes have been selected on the basis of, for example, members at the top or bottom of a differentially expressed gene list, and just the Gene IDs are used. In contrast GSEA uses the rank information for the entire list without using a threshold. The introduction to the GSEA 2005 PNAS paper discusses the limitations of the former approach and how GSEA addresses them.
2. Why does GSEA use the Kolmogorov-Smirnov statistic rather than the Mann-Whitney test?#
The Kolmogorov-Smirnov statistic is slightly more suitable for less coherent data because it takes relatively fewer significant items to score well. The GSEA 2005 PNAS paper discusses the use of this statistic in detail (see the section titled Adjusting for Variation in Gene Set Size in the supplemental information).
3. How does GSEA rank the genes in my dataset?#
By default, GSEA uses the signal-to-noise ratio to rank the genes. Other options are available in the Metric for ranking genes parameter. For more information, see the Metric for ranking genes parameter in the Run GSEA section of the GSEA User Guide.
4. Can I use GSEA to analyze my own ranked list of genes?#
Yes. Use the GseaPreranked analysis to run the gene set enrichment analysis against your own ranked list of genes. For more information, see the GSEAPreranked section of the GSEA User Guide.
5. Can I use GSEA to compare two datasets?#
One option is to create a gene set that contains the top genes from the first dataset and use GSEA to analyze that gene set against the second dataset. Similarly, create a gene set that contains the top genes from the second dataset and use GSEA to analyze that gene set against the first dataset. For example, you might analyze the top 100 genes from each dataset.
If you want to compare the enrichment results of two different datasets against each other, this can be done using the EnrichmentMap Cytoscape application
6. Can I use GSEA to analyze a dataset that contains a single sample?#
The recommended way to run GSEA on such a dataset is to use the ssGSEA module available through GenePattern. The classical GSEA algorithm can also be used however, GSEA has no way of ranking the genes in such a dataset. Therefore, you must rank the genes and then use GSEA to analyze the ranked list of genes. For more information, see the GSEA Preranked Page in the GSEA User Guide.
7. Can I use GSEA to analyze paired samples?#
The GSEA algorithm does not consider "pairedness" of samples in the dataset. Paired data analysis, e.g. data with two samples from the same individual at different timepoints, has specific statistical considerations that GSEA does not model. It is recommended to create a ranked list of genes by running a paired-sample marker analysis outside of GSEA. You can then use GSEA to analyze that ranked list of genes. For more information about analyzing your own ranked lists of genes, see the GSEA Preranked Page in the GSEA User Guide.
8. Can I use GSEA to analyze time series data?#
Yes. The phenotype labels (.cls) file defines the experimental phenotypes and associates each sample in your dataset with one of those phenotypes. To analyze time course data, use the continuous phenotype label format. When you run the GSEA analysis, select Pearson in the Metric for ranking genes parameter. This is the only metric that can be used with time series data. For more information about the metrics used for ranking genes, see Metrics for Ranking Genes in the GSEA User Guide.
9. Can I use GSEA to find pathways that correlate with the expression of a gene of interest?#
Yes. In your phenotype file, create a continuous phenotype where the expression profile is that of your selected gene.
You can have GSEA create the necessary phenotype for you: on the Run GSEA page, click the ... button next to the Phenotype labels parameter; when GSEA prompts you to select a phenotype, click the Use a gene as the phenotype button to have GSEA create a continuous phenotype for your gene. For more information, see the Phenotype labels parameter on the Run GSEA Page in the GSEA User Guide.
10. Can I analyze gene sets that were constructed from both up- and down-regulated genes?#
The GSEA software is not aware of the original expression information from the data that was used to generate the gene set. When constructing your own gene sets for analysis against a different dataset it is recommended to split up and down regulated genes into separate gene sets.
11. Can I use GSEA to analyze SNP, SAGE, CHIP-Seq or RNA-Seq data?#
For detailed information on using RNA-seq data sets with GSEA, please see this help page.
For other data types, it is generally recommended quantitatively rank the genes in order of most (largest value) to least (smallest value) "of interest" for use with GSEA-Preranked.
If the exact magnitude of the rank metric is not directly biologically meaningful select "classic" for your enrichment score (thus, not weighting each gene's contribution to the enrichment score by the value of its ranking metric).
1. Where are the GSEA statistics (ES, NES, FDR, FWER, nominal p value) described?#
- for FDR and nominal p value, see the section titled Appendix: Mathematical Description of Methods
- for FWER, see the section titled FWER in the Supplemental Information.
2. Why do you recommend a false discovery rate (FDR) of 0.25 rather than the more classic 0.05 for GSEA?#
An FDR of 25% indicates that the result is likely to be valid 3 out of 4 times, which is reasonable in the setting of exploratory discovery where one is interested in finding candidate hypothesis to be further validated as a results of future research. Given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, using a more stringent FDR cutoff may lead you to overlook potentially significant results. For more information about gene set enrichment analysis results, see Interpreting GSEA in the GSEA User Guide.
Notably, the 0.25 FDR cutoff is recommended only for running GSEA in "phenotype" permutation mode, for gene_set permutation mode, the more typical 0.05 FDR is recommended.
3. Why does GSEA give me significant results with gene set (tag) permutation, but not with phenotype permutation?#
Phenotype permutation generally provides a more stringent assessment of significance and produces fewer false positives. Which permutation type you should use depends on the number of samples that you are analyzing. For more information, see the description of the Permutation type parameter on the Run GSEA Page in the GSEA User Guide.
4. How can I display details for more than the top 20 gene sets?#
By default, the GSEA analysis report generates a Details link, which provides summary plots and detailed analysis results, for the top 20 gene sets in each phenotype. To generate the Details link for additional gene sets, modify the Plot graphs for the top sets of each phenotype parameter under "Advanced fields" on the Run GSEA Page.
Note: if rerunning GSEA in this manner while using the "timestamp" option for the Seed for permutation parameter, the new result may differ slightly from the old result. This is expected. To generate identical results, input the numerical string from the "Timestamp used as random seed:" line of the previous run's index.html page.
5. What should I do if I have no significant gene sets or too many significant gene sets?#
The number of enriched gene sets depends on the structure of the data and the problem space. In general, one would expect to see at least a few gene sets enriched for a typical morphological or tissue-specific phenotype. If no enriched gene sets or a very large number of enriched gene sets pass the FDR threshold, first check that your gene sets and expression dataset use the same array format (see: Consistent Feature Identifiers Across Data Files) and that you have used the appropriate permutation type and number of permutations (see the Run GSEA Page). If you find no issues, consider the following:
- No enriched gene sets of significance may indicate that, in fact, no gene sets are enriched. It may also be that you are analyzing too few samples, the biological signal in question is subtle, or the gene sets that you are analyzing do not represent the biology in question very well. You may still want to look at the top ranked gene sets, keeping in mind that these results provide weak evidence for potentially interesting hypotheses. You might also want to consider analyzing other gene sets or, if possible, additional samples.
- Too many enriched gene sets of significance may indicate that, in fact, many gene sets are enriched between phenotypes. Perhaps the gene sets represent the same biological signal. You can check for this by looking for overlap in the leading-edge subsets within the gene sets Running a Leading Edge Analysis). Or, you might be seeing significant differences between the phenotypes due to technical artifacts, such as samples being run in different labs, by different operators, or against different arrays. As with too few enriched gene sets, you may still want to look at the top ranked gene sets, keeping in mind that these results provide potentially biased evidence for interesting hypotheses. You might also want to consider analyzing other gene sets or, if possible, additional samples.
For more information, see Interpreting GSEA in the GSEA User Guide.
6. What does it mean for a gene set to have a nominal p value of zero?#
A reported p value of zero (0.0) indicates an actual p-value of less than 1/number-of-permutations. For a more accurate p value, increase the number of permutations performed by the analysis. For more information about gene set enrichment analysis results, see Interpreting GSEA in the GSEA User Guide.
7. What does it mean for a gene set to have a small nominal p value (p<0.025), but a high FDR value (FDR=1)?#
The nominal p value estimates the significance of the observed enrichment score for a single gene set. However, when you are evaluating multiple gene sets, you must correct for multiple hypothesis testing. The FDR is the estimated probability that a gene set with a given enrichment score (normalized for gene set size) represents a false positive finding.
Generally, when your top gene sets have small nominal p values and high FDRs, it is because they are not as significant when compared with other gene sets in the empirical null distribution. This could be because you do not have enough samples, the biological signal is subtle, or the gene sets do not represent the biology in question very well. Also, the FDR is based on all gene sets; if only one of many gene sets is enriched, that gene set is likely to have a high FDR.
For more information, see Interpreting GSEA in the GSEA User Guide.
8. What is the difference between the weighted statistic and the classic statistic? Which should I use?#
See the description of the Enrichment statistic parameter on the Run GSEA Page in the GSEA User Guide.
9. I re-ran a previous analysis, or the GSEA example datasets, why are my results different this time?#
There are several reasons why your results may vary when rerunning GSEA.
- The version of MSigDB used in the new run may have changed. Make sure you're using the same version of MSigDB that was used in the original analysis if attempting to replicate the results.
- The data was run with the "timestamp" option in the Seed for permutation parameter. To generate identical results, input the numerical string from the "Timestamp used as random seed:" line of the previous run's index.html page.
10. What does it mean for a gene set to have NES and nominal p-values of NaN (also shown as blanks)?#
When adjusting for variation in gene set size, we normalize permuted ES(S, pi) and the observed ES(S) for a given S, separately rescaling the positive and negative scores by dividing by the mean of the ES(S, pi) to yield NES(S, pi) and NES(S). However, when the scores of the corresponding sign are absent, the mean ES(S, pi) = 0, and the division by zero will result in NaN values both for the NES(S, pi) and NES(S) as well as for the nominal p-values that follow from these calculations. The easiest way to overcome this issue would be to increase number of permutations. If this does not help, we recommend further exploring reasons for this bias. For example, it could be caused by a large difference in the number of samples of each phenotype class. For further details, please consult the GSEA 2005 PNAS paper, sections "Multiple Hypothesis Testing" in the "Mathematical Description of Methods" and the "Supporting Text" Appendix.
11. Why didn't my gene set display an enrichment plot even though it is in the top hits?#
Sometimes when running GSEA with metrics other than signal-to-noise, or t-test, some plots may fail to render resulting in a Java error message being printed in red text in the place of the plot on the results page. This error is typically caused by GSEA encountering a divide-by-zero in the internal ranking metric computation, or an overflow where the result of the calculation between phenotypes returns an "infinity". The presence of infinite values in the resulting internal ranked list can be confirmed by navigating to the GSEA results directory for the given run, opening the "edb" folder, and then inspecting the .rnk file in this directory for "Inf", "NaN" or similar strings. It might be possible to resolve these issues by adding a small pseudocount to the dataset. It should be noted that any such selected value here is arbitrary and we do not advise the selection of any specific value, consult your local bioinformaticians for specific guidance. This error may also occur in GSEA Preranked mode, in which case these non-finite values would have been present in the supplied preranked list and should be resolved though the original quantification pipeline.
1. What is the difference between GSEA, GSEA-P, and GSEA-R?#
GSEA refers to either the gene set enrichment analysis or the GSEA software. GSEA-P refers to the GSEA Java desktop software. GSEA-R refers to the R implementation of the software.
We strongly recommend using the Java desktop GSEA software for standard analysis of microarray data. The Java implementation of GSEA does not require any programming experience, includes many additional features not present in GSEA-R, and comes with tutorial and extended documentation.
The R implementation of GSEA is closer to a working prototype than a finished software product. We make it available for users who want to tweak the GSEA algorithm rather than run routine GSEA analysis. We assume that such users not only have a very good command of R but are also familiar with GSEA algorithm. Also consistent with this view, the R implementation offers minimal features, leaving it up to the user to add them.
2. How do I increase the amount of memory available to GSEA?#
From the GSEA website, the GSEA desktop application can be launched with 1, 2, 4, or 8 GB of memory. Use the dropdown menu above the launch button to specify the amount. Note that 32-bit Java will not support greater than 2 GB.
When running GSEA from one of our downloadable bundles, you can increase the memory specification by editing the launcher file. This is gsea.bat on Windows, gsea.sh or gsea-hidpi.sh on Linux, or gsea.command on Mac. Modify the -Xmx setting as desired; for example, to use an 8 GB memory configuration this could be set to -Xmx8g.
The same is true of the gsea-cli.sh and gsea-cli.bat scripts for command-line usage. Note that it's also possible to edit the JNLP files in a similar way to create an even larger Java Web Start memory configuration.
For users of the Mac.app it is not easily possible to modify the memory configuration at this time. This is a known issue that we will aim to address in a future release. If a higher memory configuration is absolutely required, we suggest using the command-line bundle with your own (separate) Java 11 installation. The gsea.command script is still available for launching the GUI from this bundle.
3. How do I run GSEA from the command line?#
As of GSEA 4.0.0, there is a dedicated bundle available from our Downloads page for command-line usage; please refer to the included README file for an explanation of the command-line launcher script. Note that a separate Java 11 installation is required.
See the Command feature of the various GSEA Desktop GUI screens for example constructions of command-lines. This is by far the easiest way to understand the various flags and options.
The Linux bundle contains exactly the same launcher script and README along with an embedded OpenJDK 11 JVM, so there's no need for a separate download for users on that platform.
4. What version of Java do I need for the GSEA desktop software?#
As of GSEA 4.0.0, the various platform-specific bundles (for Windows, Mac, and Linux) include an embedded OpenJDK 11 JVM, meaning a separate Java installation is no longer required for most users.
Java 8 is required for use of the Java Web Start JNLP launchers (OpenJDK 8 is fine).
5. I can't launch GSEA from the website on Windows. This used to work; did something change?#
Yes. This is likely due to recent updates by Oracle, Microsoft, and browser vendors changing the version of Java on your computer. See Windows Launching Issues for some possible solutions.
6. GSEA gives me an error message that it is unable to access CHIP or GMT files on the Broad site. How can I fix this?#
Many IT organizations block the use of FTP of their networks, making it impossible to reach these files. We recommend downloading the files from our website (available either individually or as full ZIP bundles) and then bringing them into GSEA via the Load Data screen. Note that you will need to look for the local file options in the Run GSEA file choosers.
7. How do I add GSEA to my microarray analysis pipeline?#
If you are using GenePattern pipelines, GSEA is available as a GenePattern analysis module.
If you are implementing your own microarray analysis pipeline, GSEA can be run from the command line. Use full file specifications to ensure that you are reading data from and writing data to the desired locations. For more information, see the FAQ item above.
8. Do I have to be connected to the internet to run GSEA software?#
No. If you download the GSEA desktop application, or .jar file, you can use most functions in GSEA without being connected to the internet; for example, you can load files, run analyses, and review analysis results. However:
- The Chip platform(s) and Gene sets database parameters (on pages such as Run GSEA) display data files available from the Broad ftp site; these data files are not available when you are working offline. Be sure to download the chip files and gene set files that you need before disconnecting from the internet.
- The GSEA documentation and help files are on the GSEA web site; they are not available when you are working offline.
When working offline, clear the menu item Option>Connect over the internet. By default, this item is selected and the Chip platform(s) and Gene sets database parameters display data files available from the GSEA ftp site. Clearing the menu item disables this feature and avoids time-consuming attempts to connect to the internet.
9. How do I create the input files for GSEA in R?#
The GSEA R code uses the same gct, cls and gmt file formats for input. For more information, see Preparing Data Files in the GSEA User Guide.