GSEA User Guide

Introduction#

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes). The Gene Set Enrichment Analysis PNAS paper fully describes the algorithm. The GSEA software makes it easy to run the analysis and review the results, allowing you to focus on interpreting the analysis results.

The basic steps for running an analysis in GSEA are as follows:

  1. Prepare your data files. See Preparing Data Files for GSEA.
  1. Load your data files into GSEA. See Loading Data.
  2. Set the analysis parameters and run the analysis. See Running Analyses.
  3. View the analysis results. See Viewing Analysis Results.

Installing and Starting GSEA#

To start GSEA, select and download the appropriate GSEA package for your operating system from the Downloads page of the GSEA website.

Downloads Page

You can run GSEA in multiple ways:

This guide focuses on the GSEA desktop application and provides instructions for running GSEA from the command line. It does not provide information about R-GSEA or the GenePattern modules.

When you start the GSEA desktop application, the main window appears with the Home page displayed:

Desktop

The icons on the left provide quick access to the most common actions. Typically, each action you select opens a new page in the GSEA window. For example, selecting the Load Data icon opens the Load Data page.

The Processes pane in the bottom left corner of the GSEA window displays status information when you run an analysis.

GSEA user preferences are stored in the gsea_home directory (Help>Show GSEA home folder). GSEA analysis reports are stored in the GSEA output folder (Help>Show GSEA output folder). To change the location of the GSEA output folder and other preferences, use the Preferences Window.

The GSEA desktop application can be run without being connected to the internet. You can use most functions in GSEA without an internet connection; for example, you can load files, run analyses, and review analysis results. However, you need to be connected to the internet to view the GSEA documentation (including online Help), access the GSEA website, or access the hosted files (MSigDB gene sets and array annotations) on the GSEA-MSigDB file servers.

Getting Help#

The GSEA website is your primary source of help for GSEA. It includes the following resources:

If you cannot find the answers to your questions in the manual or the FAQ, please contact us.


Preparing Data Files for GSEA#

When you use GSEA, you supply four data files: an expression dataset file, phenotype labels file, gene sets file, and chip annotations file. The following table lists each data file and its valid file formats. All files are tab-delimited ASCII text files; they can be created and edited using any text editor.

For descriptions and examples of each file format, see GSEA file formats. For more information about each data file, click the data file link in the following table:

Data File Content Format Source
Expression dataset Contains features (genes or probes), samples, and an expression value for each feature in each sample. Expression data can come from any source (Affymetrix, Stanford cDNA, and so on). res, gct, pcl, or txt You create the file.
Phenotype labels Contains phenotype labels and associates each sample with a phenotype. cls You create the file or have GSEA create it for you.
Gene sets Contains one or more gene sets. For each gene set, gives the gene set name and list of features (genes or probes) in that gene set. gmx, gmt or grp You use the files hosted on the GSEA-MSigDB file servers, export gene sets from the Molecular Signature Database (MSigDB), or create your own gene sets file.
Chip annotations Lists each identifier on a platform and its matching HGNC gene symbol. Optional for the gene set enrichment analysis. chip You use the files hosted on the GSEA-MSigDB file servers, download the files from the GSEA website, or create your own chip file.

Consistent Feature Identifiers Across Data Files#

The expression dataset, gene sets, and chip annotation files all contain lists of features (genes or probes) to be analyzed. It is critical that you use the same feature (gene or probe) identifiers across all of the data files.

Typically, the feature identifiers in your expression dataset are the identifiers for the assay used to produce the data, either microarray probe identifiers or transcriptomic gene identifiers. For example, an expression dataset produced using the HG_U133A chip contains HG_U133A probe identifiers and an expression dataset produced using the HG_U95Av2 chip contains HG_U95Av2 probe identifiers. When using GSEA, it is critical that your expression dataset, gene sets, and chip annotation files all use compatible feature identifiers.

Typically, you use one of two approaches to ensure that you are using consistent feature identifiers across files: you collapse your probe sets into genes and use the HGNC gene symbols as your consistent feature identifiers or you use the identifiers from your expression dataset as your consistent feature identifiers.

One final note concerning consistent feature identifiers across files: within GSEA, HGNC gene symbols and probe identifiers are case sensitive; that is, the identifiers “TestGene1” and “TESTGENE1” are not the same.

Expression Datasets#

An expression dataset file contains features (genes or probes), samples, and an expression value for each feature in each sample. It is a tab-delimited text file in gct, res, pcl, or txt format. For descriptions and examples of each file format, see GSEA file formats.

Because most gene expression data is already in tab-delimited text files, or in spreadsheet and database programs that allow you to export the data into tab-delimited text files, creating expression dataset files for GSEA is relatively easy:

  1. Start with a tab-delimited file that contains your gene expression data.
  2. Open the file in Excel or a text editor.
  3. Make the necessary format changes: compare your current file with the file format described in GSEA file formats; add header rows, remove extra columns, and make any other changes necessary to create a properly formatted file.
  4. Save the file as a tab-delimited text file with the appropriate file extension (gct, res, pcl, or txt). Note: GSEA expects a very specific formatting for .txt files. See the file formats page for details. Also, be aware that some editors on some platforms automatically attach the “txt” extension to other file types (e.g. “.gct.txt”), which may confuse GSEA during parsing. Make sure to remove the extra .txt extension from the name before using the file with GSEA.

Note: When you create an expression dataset file, the GSEA team recommends that the file name include the name of the chip used to produce the expression data; for example, all_aml_dataset_hgu95av2.gct.

When creating expression dataset files, keep in mind the following:

Phenotype Labels#

A phenotype label file, also known as a class file or template file, defines phenotype labels and assigns those labels to the samples in your expression dataset. A phenotype label file is a tab-delimited text file in cls format. For descriptions and examples of the cls file format, see GSEA file formats.

About Phenotype Labels#

The GSEA algorithm works with both categorical labels and continuous labels:

Creating Phenotype Labels#

To create a phenotype labels file:

  1. Open Excel or a text editor.
  2. Create the phenotype label file using the cls file format (see GSEA file formats). Every label defined in the phenotype labels file must be assigned to at least one sample in the expression dataset. Every sample in the expression dataset must be assigned a label. The order of the labels in the cls file is important; see GSEA file formats for details. Save the file as a tab-delimited text file with the file extension cls.

Typically, you create a phenotype label file before running the gene set enrichment analysis; however, you can also have GSEA create phenotype label files for you when you run the analysis, as described in the next section.

Selecting Phenotype Labels to Analyze#

When you run the gene set enrichment analysis, you select a continuous phenotype label or a pair of categorical phenotype labels. When you run an analysis using the Run GSEA Page, you can select the phenotype labels in the following ways:

Gene Sets#

A gene sets file defines one or more gene sets. For each gene set, the file contains the gene set name and the list of genes in that gene set. A gene sets file is a tab-delimited text file in gmx or gmt format. For descriptions and examples of each file format, see GSEA file formats.

The Molecular Signature Database (MSigDB) is a publicly accessible collection of curated gene sets that is maintained by the GSEA team. The team appreciates contributions to this shared resource and encourages users to submit their gene sets to genesets@broadinstitute.org.

Selecting Hosted MSigDB Gene Sets#

When you run an analysis, you can select from a menu of MSigDB gene set files hosted by the GSEA team. The file name indicates the content of the file. For example, the gene set file c2.all.v7.0.symbols.gmt, contains all C2 gene sets from version 7.0 of the MSigDB with the genes in the gene sets identified by HGNC gene symbol.

For a list of the gene set files on the website, click the Run GSEA icon to display the Run GSEA page and click the … button next to the Gene sets database parameter:

Gene Sets Database

Exporting Gene Sets from MSigDB#

You can use the MSigDB XML Browser application to explore the gene sets in the MSigDB and to export gene sets of interest to gene set files. To display the gene sets, click the Load database button:

Load Database

From this page, you can:

For more information, see the MSigDB XML Browser section of this Guide. Alternatively, you can use the online tools on the MSigDB website to export gene sets.

Creating Gene Sets#

To create your own gene sets file:

  1. Open Excel or a text editor.
  2. Enter the gene sets details using the gmx or gmt file format (see GSEA file formats). When listing the genes in each gene set, be sure to use the appropriate gene identifiers (HGNC gene symbols or probe identifiers), as described in Consistent Feature Identifiers Across Data Files.
  3. Save the file as a tab-delimited text file with the appropriate file extension (gmx or gmt).

Note: When you create a gene sets file, the GSEA team recommends that the file name include the gene identifier format you used to list the genes; for example, setname_hgu95av2.gct. Also, be aware that some editors on some platforms automatically attach the “txt” extension to other file types (e.g. “.gmt.txt”), which may confuse GSEA during parsing. Make sure to remove the extra .txt extension from the name before using the file with GSEA.

Gene Sets and GSEA#

When choosing gene sets for a gene set enrichment analysis, keep in mind the following:

Chip Annotations#

A chip annotations file lists each identifier used in a platform and its matching HGNC gene symbol. A chip annotations file is a tab-delimited text file in chip or csv format. For descriptions and examples of the chip annotations file formats, see GSEA file formats.

How GSEA Uses Chip Annotations#

When you run the gene set enrichment analysis (Run GSEA or GSEAPreranked):

Also, when you use Chip2Chip to translate a gene set from gene symbols to the identifiers of a chip platform, the selected chip annotation file is used to translate HGNC gene symbols in the gene sets to the matching probe identifiers for the target chip(s). This can be thought of as reading the chip file backwards.

When you use the MSigDB XML Browser to export gene sets from the MSigDB, you select a target chip for the gene sets. Chip2Chip and the selected chip annotation files are used to translate the HGNC gene symbols in the MSigDB gene sets to the matching probe identifiers for the target chip.

Selecting Hosted Chip Annotations#

When you run an analysis, you can select from a menu of chip annotation files hosted on the GSEA-MSigDB file servers. These files are created and maintained by the GSEA team for your convenience. They include chip annotation files for commonly used platforms present in the Ensembl Biomart (human, mouse, and rat), as well as three specially defined chip files:

These special chip files are designed to provide “roll-up” updates for gene symbols used in older versions of genome annotations for their respective species to the versions of HGNC gene symbols used in the specific version of MSigDB targeted by the file.

For a list of the hosted chip annotation files, click the Run GSEA icon to display the Run GSEA page and click the … button next to the Chip platform(s) parameter.

Chips

Creating Chip Annotations#

If you cannot find a chip annotations file for the chip that you are using, you can create one. Creating a chip annotations file is easy; however, mapping your probe identifiers to HGNC gene symbols may be difficult or impossible. To create a chip annotations file:

  1. Start with a tab-delimited file (or text file) that lists the probes on the chip. The chip manufacturer generally provides this file.
  2. Open the file in Excel or a text editor. If using Excel be sure to import the data as “text” formatted columns and not “general”.
  3. Make the necessary format changes: compare your current file with the chip file format described in GSEA file formats; add header rows, remove extra columns, and make any other changes necessary to create a properly formatted file.
  4. Using the information you have available (for example, ortholog data), determine the matching HGNC gene symbol for each probe and add it to the file. If you cannot determine the HGNC gene symbol it can be left out of the file entirely as it will not map to any gene set member in that case. Alternatively, if you wish to keep the probe as a placeholder or for informational purposes, you can add it to the file but leave the gene symbol blank so it won’t map to anything. As mentioned above, depending on the chip that you are using, it may be difficult or impossible to determine matching gene symbols.
  5. Save the file as a tab-delimited text file with the file extension .chip.

Note: (1) The file name must not include hyphens (-). (2) When you create a chip annotation file, the GSEA team recommends that the file name be the name of DNA chip; for example, hgu95av2.chip. Also, be aware that some editors on some platforms automatically attach the “txt” extension to other file types (e.g. “.chip.txt”), which may confuse GSEA during parsing. Make sure to remove the extra .txt extension from the name before using the file with GSEA.

cDNA Microarray Data#

An expression dataset file for cDNA ratio data contains features (genes or probes), samples, and a computed ratio value for each feature in each sample. A phenotype label file for cDNA ratio data assigns distinct phenotype labels to the samples in the expression dataset.

Ratio values for cDNA data can be computed using a variety of methods. How the ratios are computed determines whether it is possible to create a phenotype label file for the cDNA ratio data. For example:

normal sample (Cy3) / treated sample (Cy5) = phenotype

When you run the gene set enrichment analysis from the Run GSEA Page, GSEA ranks the features in the expression dataset and then analyzes the ranked list of features. GSEA provides a number of metrics for ranking genes; however, all of the metrics require a phenotype label file. Alternatively, you can create a ranked list of the features in the expression dataset and then use the GSEAPreranked Page to analyze that ranked list.

If you can assign distinct phenotypes to the samples in the cDNA ratio data, analyze the data using the Run GSEA page:

  1. Create an expression dataset file for the cDNA ratio data. The file must be formatted as a pcl, res, gct, or txt file. For descriptions and examples of these file formats, see GSEA file formats. Note: If the raw expression data contains two separate values for each gene in each sample, use external software to calculate the two-color ratios before creating the expression dataset file.
  2. Create a phenotype label file that assigns a distinct phenotype label to each sample in the expression dataset file. The file must be formatted as a cls file. For a description of this file format, see GSEA file formats.
  3. Run the analysis using the Run GSEA Page.

If you cannot assign distinct phenotypes to the samples in the cDNA ratio data, analyze the data using the GSEAPreranked page:

  1. Rank the features in the expression dataset using tools external to GSEA.
  2. Create a ranked list file that contains the rank ordered list of features. The file must be formatted as an rnk file. For a description of this file format, see GSEA file formats.
  3. Run the analysis using the GSEAPreranked Page.

When selecting the chip platform(s) to use for analyzing cDNA data, the GSEA team recommends selecting both the Stanford and seq_accession chip annotation files. This helps to ensure that any non-standard cDNA identifiers in your dataset are used in annotating the analysis reports and/or collapsing the dataset (see Consistent Feature Identifiers Across Data Files).


Loading Data#

Before you can run an analysis, you must load the expression dataset (res, gct, pcl, or txt), phenotype label (cls), and gene set files (gmx or gmt) to be analyzed. Loading the files stores the data in memory, where GSEA can work with them. You must load the data files into GSEA each time you open the application.

To load the data files, click the Load Data icon in the Navigator area of the GSEA window. GSEA displays the Load Data page:

From the Load Data page, you can load files in four ways:

GSEA loads the files and adds them to the Recently Used Files and Object Cache panes (if they are not yet listed):

The icon next to a file name identifies the type of data it contains.

Select a file in the Recently Used Files or Object Cache pane and right-click in that area to display a context menu of tools appropriate for the selected file(s):

Tool Description Notes
Dataset Viewer Opens a page in the GSEA window that displays the expression dataset. Expression datasets only
Phenotype Viewer Opens a page in the GSEA window that displays the phenotype labels and the number of samples associated with each. Phenotype labels only
Report Viewer Opens a page in the GSEA window that displays the analysis report, as it appears in the Analysis History page. Reports only
Gene Matrix Viewer Opens a page in the GSEA window that displays the gene set data. Gene sets (database) only
Extract GeneSets from GeneMatrix Creates a gene set group (grp) for each gene set in the gene set file. The gene sets are created in memory and deleted when you exit from GSEA. When you run an analysis and need to select gene sets for the analysis, you will see the new gene sets listed in the gene set selection window. Gene sets (database) only
Convert the GeneMatrix into a Single GeneSet Creates one gene set group (grp) that combines the genes in all of the gene sets into one large gene set. The gene set is created in memory and deleted when you exit from GSEA. When you run an analysis and need to select gene sets for the analysis, you will see the new gene set listed in the gene set selection window. Gene set (database) only
Gene Set Viewer Opens a page in the GSEA window that displays the genes in the gene set. Gene sets (group) only
Remove duplicates from the GeneSet Removes duplicate genes from the gene set, overwriting the original gene set with the new gene set. Gene sets (group) only
View Chip Annotation Opens a page in the GSEA window that displays the probe to symbol mapping in the chip annotation file. (Typically, you do not load chip annotation files.) Chip annotations only
Ranked List Viewer Opens a page in the GSEA window that displays the ranked list of genes. For a ranked list file generated by GSEA, the display includes rank and rank metric scores. Ranked lists only
Force Data Reload. Loads the selected file again, overwriting the previously loaded data. All files
Copy Files Copies the full path of the selected files to the clipboard. You can then paste that file name where needed. All files, Recently Used Files only
Import Data Loads the selected files. All files, Recently Used Files only

Running Analyses#

The primary analysis that you run in GSEA is the gene set enrichment analysis; however, GSEA also offers other tools, which are run as analyses. This section describes how to start and track analyses:

Running a Gene Set Enrichment Analysis#

As described in the Gene Set Enrichment Analysis PNAS paper, the gene set enrichment analysis is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

To start a gene set enrichment analysis:

  1. Select the Run GSEA icon on the GSEA main page. The GSEA page appears:

GSEA Page

  1. Enter values for the parameters listed under Required Fields. Optionally, click Show to display the parameters under Basic Fields and Advanced Fields and enter values for those parameters as well. For descriptions of the parameters, click Help, which displays the Run GSEA Page of this guide.
  2. Click Run to start the analysis.
  3. Track analysis progress, as described in Tracking Analysis Progress.
  4. View analysis results, as described in Viewing Analysis Results.
  5. Interpret analysis results, as described in Interpreting GSEA Results.

Running a Leading Edge Analysis#

As described in the Gene Set Enrichment Analysis PNAS paper, the leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero. The leading-edge subset can be interpreted as the core that accounts for the gene set’s enrichment signal.

After running the gene set enrichment analysis, you use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets. A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.

To run the leading edge analysis:

  1. Select the Leading Edge Analysis icon on the GSEA main page. The Leading Edge Analysis page appears.
  2. Select a Gene Set Enrichment Report either from the application cache (analyses that you have run) or from the file system (any analysis stored in the file system).
  3. Click the Load GSEA Results button. GSEA updates the Leading Edge Analysis page to display the gene sets that were analyzed in the selected gene set enrichment report:

Leading Edge Analysis

By default, gene sets are ordered by normalized enrichment score (NES). Click a column heading to reorder the gene sets based on the values in that column. For descriptions of the columns, see Interpreting GSEA Results.

By default, all gene sets are displayed. Use the filter box to display a subset of gene sets. As you enter text in the field, GSEA updates the list of gene sets to show only those that match the entered text. To change the text search options, click the magnifying class icon in the filter box.

  1. Select one or more gene sets for the leading edge analysis. To select multiple gene sets, use SHIFT-click or CTRL-click.
  2. Click Run leading edge analysis or Build HTML Report to start the analysis:

    • Run leading edge analysis displays four graphs that help you visualize the overlap between the selected leading edge subsets. (Does not generate an analysis results report.)
    • Build HTML Report creates an analysis results report that provides details on the leading edge subsets and the overlap between them. (Does not display the graphs.)

    If you are building an HTML report: track analysis progress, as described in Tracking Analysis Progress, and view analysis results, as described in Viewing Analysis Results.

  3. Interpret analysis results, as described in Interpreting Leading Edge Analysis Results.

Running Other GSEA Analyses#

The GSEA application also provides the following analyses:

To run one of these analyses:

  1. Select the analysis from the Tools pane in the main GSEA window. The page for the selected analysis appears.
  2. Enter values for the analysis parameters. For parameter descriptions, click Help, which displays the Chip2Chip, GSEAPreranked, or CollapseDataset page of this guide.
  3. Click Run to start the analysis.
  4. Track analysis progress, as described in Tracking Analysis Progress.
  5. View analysis results, as described in Viewing Analysis Results.

Tracking Analysis Progress#

When you start an analysis, the gene set enrichment analysis or any other analysis, the Processes area in the bottom left corner of the window shows the analysis as running (blue). When the analysis is finished, it shows the analysis has succeeded (green). If an error occurs, it shows an error message (red). When you exit from GSEA, the Processes area is cleared.

  1. Click Success to display analysis results in a web browser.
  2. Click Error to display the error report. If you need help resolving the error, send us a description of the problem and the text of this error report via our help forum at groups.google.com/group/gsea-help.
  3. Running – Displayed when an analysis is ongoing. Clicking the label in this state has no effect.
  4. Click the analysis name to display the parameters used for the analysis. GSEA displays a page similar to the one you used to initially run the analysis. From this page, you can re-run the same analysis or modify the parameters to run a different analysis.
  5. Click the status bar at the bottom of the window to display the execution log file.

Rerunning an Analysis#

To rerun an analysis:

  1. From the Analysis History page, select the analysis that you want to re-run.
  2. If you want to reload the data for this analysis, check that the Load Data box is selected.

    An analysis can only be run on data that you have loaded during the current session (see Loading Data). If you have not yet loaded the data from this analysis and this is the data that you want to analyze, reload it. If you have already loaded the data, or you want to rerun the analysis with other data that you have already loaded, you do not need to reload data.

  3. Click Show in ToolRunner.

    GSEA displays a page similar to the one you used to initially run the analysis. From this page, you can leave the parameters unchanged to re-run the same analysis against the same data, or you can modify the parameters.

Note: When you analyze multiple gene sets, it is important to correct for multiple hypotheses testing. GSEA does this using sample permutation. Because of the random numbers used for sample permutation, when you rerun an analysis using the same data files and parameters, your results will be similar but not identical. Similarly, changing the order of the phenotypes does not affect analysis results, however, if you change the order of the phenotypes and rerun the analysis, your results will be similar but not identical because of the random numbers used for sample permutation.

Viewing Analysis Results#

When an analysis completes, GSEA updates the Processes area to show an analysis status of Success. To view the analysis results in a web browser, click the Success status.

Alternatively, you can use the Analysis History page to view the results of any previously run analysis:

Displaying the Analysis History Page#

To display the Analysis History page, click the Analysis history icon in the GSEA main window. The Analysis History page displays all analysis results. The left side of the page lists analyses that you have run, organized by date. Select an analysis to display its parameters and generated files.

To view analysis results, double-click the index.html file in the list of files produced by the analysis. GSEA displays the results file in a web browser. Alternatively, select any file produced as part of the analysis and then right-click in the area. From the menu that appears, select the tool that you want to use to open the file.

Analysis History

Setting the Default Output Folder#

The Analysis History page displays all analysis results that are in the default output folder (Help>Show GSEA output folder). Your analysis results are in this folder unless you have taken one of the following actions:

The default output folder contains a subfolder named with today’s date (mmmdd, e.g. jan03). When you run an analysis, by default, GSEA creates a report subfolder in today’s output folder and writes all analysis results to that report subfolder. The report subfolder contains: - the analysis report (index.html) - the files linked to that report - a subfolder named edb that contains a machine-readable version of the report

Sharing Analysis Results#

To share analysis results with a colleague:

  1. Select Help>Show GSEA output folder. GSEA displays the default reports output folder in a file browser.
  2. Locate the report subfolder for the analysis whose results you want to share. Analyses are stored in subfolders named by the date when the analysis was run. If you are looking for an analysis that you previously ran, first go up one level in the folder hierarchy and then down into the folder for the date of interest.
  3. Create a copy of that folder for your colleague. The analysis report (index.html) and the links to related files are preserved when you copy the folder.

Alternatively, when you run a gene set enrichment analysis, you can use the Make a zipped file with all reports parameter to create a zip file that contains the analysis results. If you chose to do so, you can share the analysis results by sending the zip file to your colleague. The zip file is saved in the report subfolder with all of the other analysis results.

Deleting Analysis Results#

When GSEA writes analysis results to an output folder, it also creates a matching .rpt file in the gsea_home/reports_cache folder (Help>Show GSEA home folder). The Analysis History page displays analysis results based on the .rpt files.

To delete analysis results:

  1. Locate the matching .rpt file in the gsea_home/reports_cache folder. This file lists the analysis parameters and the full path name of the report output folder.
  2. To delete the analysis results, delete the report output folder listed in the .rpt file.
  3. To remove the analysis results from the Analysis History page, delete the .rpt file from the gsea_home/reports_cache folder. To update the Analysis History page, restart GSEA.

Interpreting GSEA Results#

This section discusses the results of the gene set enrichment analysis:

GSEA Statistics#

GSEA computes four key statistics for the gene set enrichment analysis report:

Enrichment Score (ES)#

The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The ES is the maximum deviation from zero encountered in walking the list. A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list.

In the analysis results, the enrichment plot provides a graphical view of the enrichment score for a gene set:

Enrichment Score

Note: By default, the ranking metric is the signal-to-noise ratio. To have GSEA rank the genes based on a different metric, use the Metric for ranking genes parameter of the Run GSEA Page. To have GSEA analyze a ranked list of genes that you have created, use the GSEAPreranked Page.

Normalized Enrichment Score (NES)#

The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset; therefore, the normalized enrichment scores (NES) can be used to compare analysis results across gene sets. GSEA determines NES as follows:

NES

NES is based on the gene set enrichment scores for all dataset permutations; therefore, changing the permutation method, the number of permutations, or the size of the expression dataset affects the NES. As an example, consider two analyses: (1) you analyze an expression dataset, GSEA generates a ranked list and analyzes that ranked list; (2) you use GSEAPreranked to analyze the ranked list generated by the first analysis. If you use the same parameter settings, your enrichment scores are identical; however, the normalized enrichment scores reflect the very different datasets (the expression dataset versus the ranked list of genes) used for the permutations:

Expression Dataset Ranked List
Gene Set Name ES NES ES NES
BRENTANI_DNA_MET_AND_MOD 0.1233649 0.37071982 0.1233649 0.42405358
BRCA_BRCA1_NEG 0.13040805 0.6497973 0.13040808 0.6497975
PENG_RAPAMYCIN_DOWN 0.14286387 0.84542555 0.14286387 0.76681024
BASSO_REGULATORY_HUBS_SET 0.14299561 0.6870111 0.14299563 0.69177157
VENTRICLES_UP 0.14565612 0.7033464 0.14565612 0.6915998
ALANINE_AND_ASPARTATE_METABOLISM 0.14693332 0.422703 0.14693332 0.36949828
BRCA1_OVEREXP_DN 0.15077576 0.7929205 0.15077576 0.68026066

Analysis parameters: P53_hgu95av2.gct, P53.cls#MUT_versus_WT, c2.may_2006.symbols.gmt, permutation type = gene_set, seed for permutation = 149, number of permutations = 10

False Discovery Rate (FDR)#

The false discovery rate (FDR) is the estimated probability that a gene set with a given NES represents a false positive finding. For example, an FDR of 25% indicates that the result is likely to be valid 3 out of 4 times. The GSEA analysis report highlights enrichment gene sets with an FDR of less than 25% as those most likely to generate interesting hypotheses and drive further research, but provides analysis results for all analyzed gene sets. In general, given the lack of coherence in most expression datasets and the relatively small number of gene sets being analyzed, an FDR cutoff of 25% is appropriate. However, if you have a small number of samples and use gene_set permutation (rather than phenotype permutation) for your analysis, you are using a less stringent assessment of significance and would then want to use a more stringent FDR cutoff, such as 5%.

The FDR is a ratio of two distributions: (1) the actual enrichment score versus the enrichment scores for all gene sets against all permutations of the dataset and (2) the actual enrichment score versus the enrichment scores of all gene sets against the actual dataset. For example, if you analyze four gene sets and run 1000 permutations, the first distribution contains 4000 data points and the second contains 4. For an example of what the enrichment score for a permutation of the dataset might look like, consider the two enrichment plots shown below. The plot on the left shows actual enrichment results for the P53HYPOSIAPATHWAY gene set against the P53 dataset. The plot on the right shows enrichment results for that gene set against a phenotype permutation of the dataset (that is, when phenotype labels are randomly assigned to the samples).

FDR

Generally speaking, the larger the absolute NES the smaller the FDR; that is, as the absolute NES decreases the corresponding FDR increases. However, because the distribution curves tend to be “bumpy” at the tails, you may notice exceptions to this in your GSEA results. For similar reasons, although FDR is less conservative than FWER, you may notice instances in the GSEA results where the FWER is less than FDR.

The Gene Set Enrichment Analysis PNAS paper describes the FDR statistic in the section titled Appendix: Mathematical Description of Methods. For a more detailed discussion of the FDR, including a comparison with the more conservative familywise-error rate (FWER) statistic, see Benjamini and Hochberg (1995).

Nominal P Value#

The nominal p value estimates the statistical significance of the enrichment score for a single gene set. However, when you are evaluating multiple gene sets, you must correct for gene set size and multiple hypothesis testing. Because the p value is not adjusted for either, it is of limited value when comparing gene sets. The Gene Set Enrichment Analysis PNAS paper describes the p value statistic in the section titled Appendix: Mathematical Description of Methods.

The FDR is adjusted for gene set size and multiple hypotheses testing while the p value is not. When a top gene set has a small nominal p value and a high FDR value, it generally indicates that it is not as significant when compared with other gene sets in the empirical null distribution. This could be because you do not have enough samples, the biological signal is subtle, or the gene sets do not represent the biology in question very well. On the other hand, the FDR is based on two distributions of all gene sets; if only one of many gene sets is enriched, that gene set is likely to have a high FDR. Finally, a top gene set with a high nominal p value and a low FDR value, generally indicates a negative result: the gene set itself is not significant and other sets are weaker.

In the GSEA report, a p value of zero (0.0) indicates an actual p value of less than 1/number-of-permutations. For example, if the analysis performed 100 permutations, a reported p value of 0.0 indicates an actual p value of less than 0.01. For a more accurate p value, increase the number of permutations performed by the analysis. Typically, you will want to perform 1000 permutations (phenotype or gene_set). (If you attempt to perform significantly more than 1000 permutations, GSEA may run out of memory.)

GSEA Report#

This section discusses the content of the report generated by the gene set enrichment analysis:

Enrichment in Phenotype#

Enrichment in Phenotype

The analysis report contains two “Enrichment in Phenotype” sections. The first section shows results for gene sets that have a positive enrichment score (gene sets that show enrichment at the top of the ranked list) and the second section shows results for gene sets that have a negative enrichment score (gene sets that show enrichment at the bottom of the ranked list). For categorical phenotypes, a positive enrichment score indicates correlation with the first phenotype and a negative enrichment score indicates correlation with the second phenotype. For continuous phenotypes (time series or gene of interest), a positive value indicates correlation with the phenotype profile and a negative value indicates no correlation or inverse correlation with the phenotype profile.

For each phenotype, the report shows:

The number of enriched gene sets depends on the structure of the data and the problem space. In general, one would expect to see at least a few gene sets enriched for a typical morphological or tissue-specific phenotype. If no enriched gene sets or a very large number of enriched gene sets pass the FDR threshold, first check that your gene sets and expression dataset use the same array format (see Consistent Feature Identifiers Across Data Files) and that you have used the appropriate permutation type and number of permutations (see the Run GSEA Page). If you find no issues, consider the following:

Dataset Details#

Dataset Details

The Dataset Details section of the analysis report provides information about the expression dataset:

Dataset Details Part 2

Gene Set Details#

Gene Set Details

The Gene Set Details section of the analysis report provides information about the gene sets:

Note: If all gene sets are filtered out, the analysis fails. Typically, this occurs for one of the following reasons:

Gene Markers#

Gene Markers

The Gene Markers section of the analysis report provides information about the ranked list of genes used for the analysis:

The bottom portion of the enrichment plot shows the observed correlation between gene rank and the ranking metric score for all genes in the ranked list. The butterfly plot shows the observed correlation, as well as permuted (1%, 5%, 50%) positive and negative correlation, for the top genes. The butterfly plot offers one way to visualize the extent to which dataset permutations change the correlation between gene rank and the ranking metric score.

Diabetes Dataset

Global Statistics and Plots#

Global Statistics and Plots

The Global Statistics and Plots section provides additional information about the gene sets and enrichment results:

Other#

Other

The final section of the report, Other, lists the analysis parameters. Knowing the parameters is critical for reproducing analysis results.

Detailed Enrichment Results#

From the Enrichment in Phenotype section of the analysis report, you can click a link to display the detailed enrichment results report, which lists all gene sets enriched in this phenotype ordered by the normalized enrichment score (NES):

Gene Sets Enriched In Phenotype

GS Gene set name. Click the gene set name for a detailed description of the gene set. For MSigDB gene sets, the description is the gene set page on the GSEA website. For other gene sets, the description is provided by the author of the gene set.
GS DETAILS For the top 20 gene sets, click the Details link to display the Gene Set Details Report. To generate the Details link for a different number of gene sets, use the Plot graphs for the top sets of each phenotype parameter on the Run GSEA Page.
SIZE Number of genes in the gene set after filtering out those genes not in the expression dataset.
ES Enrichment score for the gene set; that is, the degree to which this gene set is overrepresented at the top or bottom of the ranked list of genes in the expression dataset.
NES Normalized enrichment score; that is, the enrichment score for the gene set after it has been normalized across analyzed gene sets.
NOM p-value Nominal p value; that is, the statistical significance of the enrichment score. The nominal p value is not adjusted for gene set size or multiple hypothesis testing; therefore, it is of limited use in comparing gene sets.
FDR q-value False discovery rate; that is, the estimated probability that the normalized enrichment score represents a false positive finding.
FWER p-value Familywise-error rate; that is, a more conservatively estimated probability that the normalized enrichment score represents a false positive finding. Because the goal of GSEA is to generate hypotheses, the GSEA team recommends focusing on the FDR statistic.
RANK AT MAX The position in the ranked list at which the maximum enrichment score occurred. The more interesting gene sets achieve the maximum enrichment score near the top or bottom of the ranked list; that is, the rank at max is either very small or very large.
LEADING EDGE

Displays the three statistics used to define the leading edge subset.

  • Tags. The percentage of gene hits before (for positive ES) or after (for negative ES) the peak in the running enrichment score. This gives an indication of the percentage of genes contributing to the enrichment score.
  • List. The percentage of genes in the ranked gene list before (for positive ES) or after (for negative ES) the peak in the running enrichment score. This gives an indication of where in the list the enrichment score is attained.
  • Signal. The enrichment signal strength that combines the two previous statistics:

where N is the number of genes in the list and Nh is the number of genes in the gene set. If the gene set is entirely within the first Nh positions in the list, then the signal strength is maximal or 100%. If the gene set is spread throughout the list, then the signal strength decreases towards 0%.

These statistics describe the leading-edge subset of a single gene set. Use the Leading Edge analysis to analyze the overlap between multiple leading-edge subsets.

Gene Set Details Report#

From the Detailed Enrichment Results table, click the Details link for a gene set to display a Gene Set Details report that contains the following:


Interpreting Leading Edge Analysis Results#

When you click Run leading edge analysis on the Leading Edge Analysis Page, GSEA displays four graphs that help you visualize the overlap between the selected leading edge subsets. When you click Build HTML Report, GSEA generates an analysis results report that provides details on the leading edge subsets and the overlap between them. This section describes each graph and then the report:

Heat Map#

The heat map shows the (clustered) genes in the leading edge subsets. In a heat map, expression values are represented as colors, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest).

Heat Map

Set-to-Set#

The top right graph uses color intensity to show the overlap between subsets: the darker the color, the greater the overlap between the subsets. Specifically, the intensity of the cell for sets A and B corresponds to an X/Y ratio where X is the number of leading edge genes from set A and Y is the union of leading edge genes in sets A and B. A dark green cell indicates that sets A and B have the same leading edge genes and a white cell indicates that sets A and B have no leading edge genes in common.

Set to Set

Gene in Subsets#

The bottom left graph shows each gene and the number of subsets in which it appears.

Gene Subsets

Histogram#

The last plot is a histogram, where the Jacquard is the intersection divided by the union for a pair of leading edge subsets. Number of Occurrences is the number of leading edge subset pairs in a particular bin. In this example, most subset pairs have no overlap (Jacquard = 0).

Number of Occurrences

HTML Report#

The HTML Report for the leading edge analysis contains the following sections:


Running GSEA from the Command Line#

GSEA is most commonly used by running the GSEA desktop application that provides the user interface for controlling the analyses. However, you can also run GSEA from the command line. This can be useful, for example, when you want to analyze several datasets at once or analyze a large dataset, or a large number of gene sets, on a server or compute cluster.

Syntax#

To run GSEA from the command line, download the “GSEA for the command line (all platforms)” zip bundle from the website and use either the gsea-cli.sh (Mac and Linux) or gsea-cli.bat (Windows) script. Note that this bundle requires the user to have the correct version of Java installed and available. The required Java version is noted on the Downloads page of the website. The “GSEA for Linux” zip bundle also contains the gsea-cli script along with an embedded platform-specific copy of Java.

Use a command of the form: gsea-cli.sh operation-name parameters

operation-name Specifies the analysis to use. One of: GSEA, GSEAPreranked, CollapseDataset, Chip2Chip, or LeadingEdgeTool
parameters

Specifies the analysis parameters. To find the parameters for an analysis, open the GSEA desktop application, display the page that runs the analysis, enter the parameters that you want to use, and click the Command button at the bottom of the page. GSEA displays the command line used to run the analysis. If you omit a parameter, GSEA uses the default value as displayed in the GSEA application.

  • Paths to file names must be fully specified or relative to the execution directory. When creating batch files, you generally want to use full path names for all files.
  • File names are platform-specific and may require editing. For example, on Windows, a file name that contains spaces must be enclosed in quotation marks.
  • Files cannot be directly referenced from the GSEA-MSigDB file servers. Download the desired gene set or array annotations files from the GSEA website downloads page and reference the downloaded files in the command line.
  • Parameter values cannot include hyphens (-); therefore, file names cannot include hyphens. If necessary, change hyphens to underscores. For example, you cannot use -res my-dataset.gct, but can use -res my_dataset.gct instead. GSEA also has issues with parameter values containing spaces or other special characters.
  • Optionally, use the –param_file parameter to specify a parameter file, which can contain any parameter except –param_file. If you specify the same parameter on the command line and in the parameter file, the value on the command line takes precedence. A parameter file is a text file that defines one parameter per line. Each line contains a parameter name (without the initial hyphen), a tab (not spaces), and the parameter value.

    Note: The Leading Edge Analysis Page does not include a Command button; therefore, the command line syntax for building a leading edge HTML report is provided here:

    gsea-cli.sh LeadingEdgeTool -dir path_to_gsea_report_dir -gsets set_names_comma_delimited

    For very large analyses, the GSEA default memory specification might not be sufficient. You can increase the memory available to GSEA by editing the gsea-cli script and changing the -Xmx4g parameter to a larger value (alternatively, on memory-constrained machines it might be necessary to reduce this value). The default specifies 4 GB of memory, so change this to -Xmx8g for 8 GB or -Xmx2g for 2 GB, for example.

    Only edit the script with a plain-text editor (such as TextEdit.app on Mac, emacs or vi on Linux, or NotePad on Windows). Be sure to save the file as plain-text only, no matter which editor you use.

    Output#

    By default, the GSEA command line writes analysis reports to a dated subfolder, mmmdd, in the current working directory. To write analysis reports to a different location, use the –out parameter. (GSEA will still create the mmmdd subfolder in the current directory, but writes the reports to the specified location.) To specify a report name, rather than using the default name of my_analysis, use the -rpt_label parameter.

    Note: The GSEA application uses a different graphical imaging package than the GSEA command line; therefore, heat maps generated from the GSEA command line look different from heat maps generated by the GSEA application.

    Examples#

    1. Following is a command line that might appear when you click the Command button in GSEA. In this example, the –gmx and –chip parameters reference files hosted on the GSEA-MSigDB file servers. You must download these files from the GSEA website and update the command line to reference the downloaded files. If necessary, quote file names that include spaces and/or remove hyphens from the file names.

      gsea-cli.sh GSEA

      -res P53_hgu95av2.gct -cls P53.cls#MUT_versus_WT -gmx c1.v2.symbols.gmt -chip HG_U95Av2.chip

      -collapse true -mode Max_probe -norm meandiv -nperm 1000 -permute phenotype

      -rnd_type no_balance -scoring_scheme weighted -rpt_label my_analysis

      -metric Signal2Noise -sort real -order descending -include_only_symbols true

      -make_sets true -median false -num 100 -plot_top_x 20 -rnd_seed timestamp

      -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false

      -out dec18 -gui false

    2. Following is a command line that assumes that the identifiers in your dataset match those in your gene sets:

      gsea-cli.sh GSEA -res test.gct -cls test.cls -gmx test.gmx -collapse false

    3. Following is a command line that assumes that your dataset uses HG_U133A probe identifiers and your gene sets use gene symbols, so you want to collapse your dataset:

      gsea-cli.sh GSEA -res foo.gct -cls foo.cls -gmx foo.gmx -chip HG_U133A.chip


    Quick Reference#

    This section provides descriptions of the GSEA menu bar and windows:

    File#

    Downloads#

    Help#

    GSEA Main Window#

    The GSEA main window appears when you start the GSEA desktop application. The one page open in the window is the Startup page. As you open new pages, tabs appear next to the Startup tab. To close a page, click the close (X) icon on the tab.

    GSEA Main Window

    Load Data Page#

    Use the Load Data page to load data files into GSEA. You must load data files before you can analyze them. To display the Load Data page, select the Load Data icon in the GSEA main window. For more information, see Loading Data.

    Load Data Page

    Run GSEA Page#

    Use the Run GSEA page to run the gene set enrichment analysis. To display this page, click the Run GSEA icon in the GSEA main window.

    Run GSEA Page

    Place your cursor on a parameter name to see a brief description of the parameter.

    Required Fields lists parameters that are essential for the analysis. Enter values for these parameters before starting the analysis.

    Note: In previous versions of GSEA, gene_set permutation was referred to as tag permutation.

    Basic Fields lists additional parameters with standard defaults. Typically, you use the default values for these parameters. Click Show/Hide to display and hide these parameters.

    Note: The default metric for ranking genes is the signal-to-noise ratio. To use this metric, your phenotype file must define at least two categorical phenotypes and your expression dataset must contain at least three (3) samples for each phenotype. If you are using a continuous phenotype or your expression dataset contains fewer than three samples per phenotype, you must choose a different ranking metric. If your expression dataset contains only one sample, you must rank the genes and use the GSEAPreranked Page to analyze the ranked list; none of the GSEA metrics for ranking genes can be used to rank genes based on a single sample.

    Advanced Fields lists parameters that control details of the GSEA algorithm and its Java implementation. Do not change the default values of these parameters unless you are conversant with the algorithm and its Java implementation. Click Show/Hide to display and hide these parameters.

    Buttons at the bottom of the page:

    Metrics for Ranking Genes#

    When you run the gene set enrichment analysis from the Run GSEA Page, GSEA ranks the genes in the expression dataset and then analyzes that ranked list of genes. You use the Metric for ranking genes parameter to select the metric used to score and rank the genes; the Gene list sorting mode parameter to determine whether to sort the genes using the real (default) or absolute value of the metric score; and the Gene list ordering mode parameter to determine whether to sort the genes in descending (default) or ascending order.

    This section describes each of the ranking metrics in the drop-down list of the Metric for ranking genes parameter. If your favorite metric is not listed here, you can rank the genes in your dataset using that metric and then use the GSEAPreranked Page to analyze your ranked list of genes. If your dataset contains only one sample, GSEA cannot rank the genes; however, you can rank the genes and then use the GSEAPreranked Page to analyze your ranked list of genes.

    Three settings in the Algorithms tab of the Preferences window (selected from the File menu in the application menu bar) affect the calculations shown here:

    For categorical phenotypes, GSEA determines a gene’s mean expression value for each phenotype and then uses one of the following metrics to calculate the gene’s differential expression with respect to the two phenotypes. To use median rather than mean expression values, set the Median for class metrics parameter to True, as described above.

    Signal2Noise

    where μ is the mean and σ is the standard deviation; σ has a minimum value of .2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    tTest

    where μ is the mean, n is the number of samples, and σ is the standard deviation; σ has a minimum value of .2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    Ratio_of_Classes

    where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    log2_Ratio_of_Classes

    where μ is the mean. This is the recommended statistic for calculating fold change for natural scale data.

    For continuous phenotypes, GSEA determines an ideal expression profile based on the phenotype (.cls) file, determines a gene’s expression profile based on the expression dataset (.gct) file, and then uses one of the following metrics to calculate the correlation between the two expression profiles. Note: You can also use these metrics to analyze categorical phenotypes: in your phenotype labels file, specify the categorical phenotype labels as numbers.

    Pearson is the only metric that does not require the two profiles to use the same unit of measure; therefore, Pearson is the only metric that can be used with a time series phenotype. For the same reason, of the continuous phenotype metrics, Pearson is the most useful for analyzing categorical phenotypes.

    Statistics reference: Statistics for Microarrays, Wit, E. and McClure J., John Wiley & Sons Ltd., 2004.

    Select a Phenotype Window#

    On the Run GSEA page, next to the Phenotype labels parameter, click the ellipse (…) button to display the following window, which allows you to select a phenotype to analyze.

    Select a Phenotype

    1. Enter the names of one or more samples in the box on the left and enter a name for that phenotype class (by default, ClassA). For easier entering of sample names, cut and paste from your dataset file.
    2. Enter the names of one or more samples in the box on the right and enter a name for that phenotype class (by default, ClassB). If your dataset contains samples not included in the two phenotypes, GSEA automatically excludes them from the gene set enrichment analysis of these phenotypes.
    3. Select your dataset and click Apply to dataset. GSEA confirms that all of the samples that you specified are in the selected dataset and creates a phenotype labels file (ClassAvsClassB.cls) in the default output folder. When you close this window, the new phenotype labels file appears in the Select source file drop-down list of the Select a Phenotype window.

    On-the-fly Phenotype

    Gene as Phenotype

    Leading Edge Analysis Page#

    To display the Leading Edge Analysis page, select the Leading Edge Analysis icon in the GSEA main window. For more information, see Running a Leading Edge Analysis and Interpreting Leading Edge Analysis Results.

    Leading Edge Analysis

    Chip2Chip Page#

    The Chip2Chip analysis translates the gene identifiers in a gene sets from HGNC gene symbols to the probe identifiers for a selected DNA chip. If you prefer to analyze your dataset without collapsing the probe sets to gene symbols, you can use Chip2Chip to translate MSigDB gene sets to the required chip platform format (see Consistent Feature Identifiers Across Data Files).

    To display the Chip2Chip page, select the Chip2Chip icon in the GSEA main window.

    Chip2Chip

    Place your cursor on a parameter name to see a brief description of the parameter.

    Required Fields lists parameters that are essential for the analysis. Enter values for these parameters before starting the analysis.

    Basic Fields lists additional parameters with standard defaults. Typically, you use the default values for these parameters. Click Show/Hide to display and hide these parameters.

    Advanced Fields lists parameters that control details of the GSEA algorithm and its Java implementation. Do not change the default values of these parameters unless you are conversant with the algorithm and its Java implementation. Click Show/Hide to display and hide these parameters.

    Buttons at the bottom of the page:

    Analysis History Page#

    To display the Analysis History page, select the Analysis History icon in the GSEA main window. The tree on the left lists all analyses in the GSEA output folder; those from the current session and those from previous sessions. When you select an analysis from the Analysis History tree, the analysis parameters and a list of files generated by the analysis appear on the right. For more information about using the Analysis History page, see Viewing Analysis Results.

    Analysis History

    GSEAPreranked Page#

    The GSEAPreranked page runs the gene set enrichment analysis against a ranked list of genes, which you supply.

    Best Practices for Creating and Running Your Ranked List#

    The GSEAPreranked tool can be very helpful for performing gene set enrichment analysis on data that do not conform to the typical GSEA scenario. For example, it can be used when the ranking metric choices provided by GSEA are not appropriate for the data, or when a ranked list of genomic features deviates from traditional gene expression data (e.g., GWAS results, CHIP seq, etc.). However, there are several important points that you should keep in mind when creating your input ranked list and running the GSEAPreranked tool.

    Understand and keep in mind the sorting of your ranked list.#

    GSEAPreranked always sorts your data, without consideration of the data type. The numbers are treated the same whether they represent ranking metrics, significance p values, or something else. The list is sorted in descending numerical order, and there is no option to change this in the GSEAPreranked tool (unlike standard GSEA).

    Avoid using GSEA to collapse your ranked list to gene symbols.#

    In order to calculate enrichment scores, GSEA needs to match genes from gene sets to those in your input ranked list. Typically, GSEA is run using gene sets from MSigDB, which consist of human gene symbols. If the input data contain other types of identifiers, such as Affymetrix probe set identifiers, they need to be converted to gene symbols to match the identifiers in MSigDB sets. GSEA provides the ‘Collapse/Remap to gene symbols’ option to perform this conversion, which includes handling the case of several feature identifiers mapping to the same gene identifier. However, this option was developed and tuned with gene expression data in mind, whereas the numbers in a user-defined ranked list represent a metric that was computed by an unspecified ranking procedure outside of GSEA. Therefore, when using the GSEAPreranked tool, we recommend you provide a ranked list that already has unique human gene symbols and select ‘No_Collapse’ for the parameter Collapse/Remap to gene symbols. Alternatively, you can use GSEA’s collapse/remap method to convert your features to human gene symbols as long as there are no duplicate features in the list and they have a one-to-one correspondence to human gene symbols. For this, use the ‘Remap_Only’ option which will display an error message if multiple mappings to the same gene are detected.

    Choose the right ranking metric.#

    It is strongly recommended to make sure that the data do not include duplicate ranking values because GSEA does not resolve ties. In the case of a tie, the order of genes will be arbitrary, which may or may not produce erroneous results.

    Understand and keep in mind the permutation test type.#

    In GSEAPreranked, permutations are always done by gene set. In standard GSEA you can choose to set the parameter Permutation type to ‘phenotype’ (the default) or ‘gene set’, but this option is not available in GSEAPreranked.

    Understand and keep in mind how GSEA computes enrichment scores.#

    The GSEA PNAS 2005 paper introduced a method where a running sum statistic is incremented by the absolute value of the ranking metric when a gene belongs to the set. This method has proven to be efficient and facilitates intuitive interpretation of ranking metrics that reflect correlation of gene expression with phenotype. In the case of GSEAPreranked, you should make sure that this weighted scoring scheme applies to your choice of ranking statistic — i.e. the magnitude of the ranking metric is biologically meaningful. When in doubt, we recommend using a more conservative scoring approach by setting Enrichment statistic = ‘classic’. Please refer to the GSEA PNAS 2005 paper for further details.

    Using GSEAPreranked#

    To display this page, select Tools>GseaPreranked.

    GSEA Preranked

    Place your cursor on a parameter name to see a brief description of the parameter.

    Required Fields lists parameters that are essential for the analysis. Enter values for these parameters before starting the analysis.

    If necessary, create a ranked gene list file (rnk) that defines the list of ranked genes. For a description of this file format, see GSEA file formats. You can create and edit the file using any text editor. If you use Excel, be sure to save the file as a tab-limited text file. Load the file into GSEA, as described in Loading Data.

    Basic Fields lists additional parameters with standard defaults. Typically, you use the default values for these parameters. Click Show/Hide to display and hide these parameters.

    Advanced Fields lists parameters that control details of the GSEA algorithm and its Java implementation. Do not change the default values of these parameters unless you are conversant with the algorithm and its Java implementation. Click Show/Hide to display and hide these parameters.

    Buttons at the bottom of the page: - Help. Displays this documentation. - Reset. Restores the default values for all parameters. - Last. Loads the data used the last time you ran this analysis. - Command. Displays the command line used to run the analysis, as described in Running GSEA from the Command Line. - Run. Starts the analysis.

    CollapseDataset Page#

    CollapseDataset creates a new dataset by collapsing all probe set values for a gene into a single vector of values. The new dataset uses gene symbols as the gene identifier format. When you use the new dataset in a gene set enrichment analysis, be sure that your gene sets and array annotations also use gene symbols as the gene identifier format. For more information, see Consistent Feature Identifiers Across Data Files.

    Note: By default, when you use the Run GSEA icon to run the gene set enrichment analysis, GSEA uses the CollapseDataset tool to collapse the dataset before running the gene set enrichment analysis. For more information, see the Collapse dataset to gene symbols parameter on the Run GSEA Page.

    CollapseDataset

    Place your cursor on a parameter name to see a brief description of the parameter.

    Required Fields lists parameters that are essential for the analysis. Enter values for these parameters before starting the analysis.

    Basic Fields lists additional parameters with standard defaults. Typically, you use the default values for these parameters. Click Show/Hide to display and hide these parameters.

    Buttons at the bottom of the page:

    Preferences Window#

    To display the Preferences window, select File>Preferences. Use this window to set GSEA configuration options.

    General Settings#

    General Settings

    Algorithm Settings#

    Preferences

    MSigDB XML Browser#

    The MSigDB XML browser is a separate application available from the GSEA website downloads page. Use the MSigDB XML Browser to explore the gene sets and to export the gene sets of interest to gene set files that can be used with the gene set enrichment analysis.

    To display the latest gene sets on the Browse MSigDB page, click the Load database button.

    MSigDB XML Browser

    Note that you can also use this window to upload archived MSigDB files. For example, to load the MSigDB files from the v6.2 release, enter "msigdb_v6.2.xml" in the File path or URL to the MSigDB database field and click the Load database button.

    You can upload multiple versions of MSigDB and toggle between them by clicking their respective tabs.

    Use the filter field and the QuickFilterPane to filter the gene sets displayed in the table. GSEA displays gene sets that meet both the filter field AND quick filter criteria that you specify:

    Filter Field

    The filter shown above uses the default options All, Case insensitive, and Match anywhere to display all gene sets that have the characters “ca” anywhere in any column. To display gene sets whose names begin with the characters “ca”, click the NAME and Match from start options.

    The Deep Search Options provide more advanced search options:

    When you select a Deep Search Option, GSEA performs the search and displays the results in a new tab identified by the name you supplied. The original MSigDB page remains displayed in a separate tab, as shown below:

    Deep Search

    To export gene sets from the MSigDB to a gene set file that can be used with GSEA:

    1. Select one or more of the gene sets in the table on the MSigDB page. To select multiple gene sets, use SHIFT-Click or CTRL-click.
    2. Click Export sets as GeneSetMatrix. GSEA displays the following window:

    Export Selected Gene Sets

    1. Select the target chip. Click the ellipse (…) button and select one or more DNA chip (array) annotation files:

      • Chips (from website) lists the chip annotation files hosted on the GSEA-MSigDB file servers.
      • Chips (local .chip) lists the chip annotation files that you have loaded (see Loading Data).

      GSEA uses Chip2Chip to translate the gene identifiers in the gene sets from HGNC gene symbols to the probe identifiers for the selected DNA chips.

    2. Select the items to export:

      • All items: exports all gene sets displayed in the table on the MSigDB page.
      • Selected items: exports only the selected gene sets in the table on the MSigDB page.
    3. Select the file format for the gene set file. Typically, you want to select gmt.

    4. Enter a name for the resulting gene set file.
    5. Click OK. GSEA writes the gene sets file to the default output folder (Help>Show GSEA output folder).

    Appendix A: GSEA Error Codes#

    Error 1001#

    None of the gene sets that you specified passed the size threshold. Check that the selected gene sets:

    1. Contain more than the minimum number of genes. The minimum gene set size is set by the Min size parameter (by default, 15).
    2. Contain fewer than the maximum number of genes. The maximum gene set size is set by the Max size parameter (by default, 500).
    3. Use the gene identifiers appropriate for this analysis:
      If you are collapsing your dataset (Collapse dataset to gene symbols parameter = True), the genes in the collapsed dataset are identified by gene symbol, so the gene identifiers in your gene sets must be gene symbols.
      If you are not collapsing your dataset (Collapse dataset to gene symbols parameter = False), the gene identifiers in your gene sets must be the same as those in your expression dataset.

    For more information about "appropriate" gene identifiers, see Consistent Feature Identifiers Across Data Files.

    Illumina data: The probe identifiers for Illumina chips contain leading zeros. Certain programs, such as Excel, automatically remove leading zeros. If you are collapsing your dataset, the probe identifiers in your dataset must match the probe identifiers in the chip annotation file. If this error occurs and you are using Illumina data, check that the probe identifiers in your dataset include the leading zeros.

    Error 1002#

    When you set the Collapse/Remap to gene symbols parameter to "Collapse" or "Remap_only", you must select one or more chip annotation files using the Chip platform(s) parameter.

    Setting the Collapse/Remap to gene symbols parameter to "Collapse" tells GSEA to collapse each probe set in the dataset to a single vector for the gene, which gets identified by its HUGO gene symbol. In contrast, "Remap_only" uses the information in the chip to perform simple, non-mathematical translations of gene identifiers from one format to another. To do these operations, GSEA needs to map the probes in the dataset to their matching HUGO gene symbols. The chip annotation file(s) supply that mapping information. For more information, see DNA Chip (Array) Annotations.

    Alternatively, you can set the Collapse/Remap to gene symbols parameter to "No_Collapse". In this case, GSEA does not perform any mapping operations on your dataset and the Chip platform(s) parameter is not used.

    Error 1005#

    GSEA attempted to collapse each probe set in the dataset to a gene, but there was no data in the resulting dataset. The most likely reasons for this are:

    Check the gene/probe identifiers in your dataset. If they are HUGO gene symbols, set the Collapse dataset to gene symbols parameter to False; the dataset is already collapsed. Otherwise, check that the probe identifiers in your dataset match the probe identifiers listed  in the chip file that you selected in the chip platform(s) parameter. For more information, see Consistent Feature Identifiers Across Data Files.

    Error 1006#

    The metric that you selected in the Metric for ranking genes parameter (Signal2Noise or tTest) requires that you have at least three samples for each phenotype. You have too few samples for at least one of the phenotypes selected in the Phenotype labels parameter.

    To analyze a categorical phenotype that has fewer than three samples, use one of the following ranking metrics:

    For information about the metrics, see the Metric for ranking genes parameter on the Run GSEA Page.

    For information about phenotypes, see Phenotype Labels.

    Error 1010#

    You selected a categorical phenotype for the Phenotype labels parameter; however, the metric that you selected in the Metric for ranking genes parameter is used with continuous phenotypes. Select one of the following metrics for a  categorical phenotype:

    For information about phenotypes, see Phenotype Labels.

    For information about the metrics, see the Metric for ranking genes parameter on the Run GSEA Page.

    Error 1011#

    You selected a continuous phenotype for the Phenotype labels parameter; however, the metric that you selected in the Metric for ranking genes parameter is used with categorical phenotypes. Select one of the following metrics for a  continuous phenotype:

    For information about phenotypes, see Phenotype Labels.

    For information about the metrics, see the Metric for ranking genes parameter on the Run GSEA Page.

    Error 1020#

    Multiple identifiers mapped to a single gene in "Remap_Only" mode#

    GSEA's "Remap_Only" collapsing mode is designed to convert gene identifiers across transcriptome versions, but does not perform any mathematical operations to handle multiple input symbols mapping to the same output symbol. The GSEA application only supports single instances of a given gene symbol in an analysis run. When GSEA encounters the case of multiple instances of the same gene symbol, the first instance is retained and the later instances are arbitrarily discarded. In order to prevent producing this arbitrary output, when multiple input values mapping to a single output value are encountered in Remap_Only mode, the attempt to remap symbols fails and returns Error 1020.

    In order to remedy this error, it is recommended to choose an appropriate mathematical collapse function for the input data set, whereby multiple identifiers corresponding to the same gene will be collapsed to a single entry in the output file.

    GSEA supports the following options for performing this mathematical collapse:

    Appendix B: CHIP File Selection Help#

    The 4.3.0 version of GSEA introduced support for the new gene set database files and chip files to support native analysis of mouse data first made available with MSigDB v2022.1.

    These files are now split into two different tabs in both the GSEA gene sets and chip files UI windows.

    Gene set files from the Human Collections have the suffix ".Hs.symbols.gmt" following the MSigDB version (v2022.1 in this initial release), gene set files from the Mouse Collections have the suffix ".Mm.symbols.gmt". Gene set files from the Human Collections have their contents provided in HGNC Gene Symbols, gene set files from the Mouse Collections have their contents provided in MGI Gene Symbols. These symbols follow different canonical formats (i.e. MTOR for human vs. Mtor for mouse).

    As such, it is critical to pick CHIP files that map your data into the appropriate namespace (either human or mouse symbols).

    Therefore, if a gene set file from the Human Collection (MSigDB) window is selected, you must also select a CHIP file from the Human Collection Chips (MSigDB) tab of the Chip platform selector window, and if a gene set file from the Mouse Collection (MSigDB) window is selected, you must also select a CHIP file from the Mouse Collection Chips (MSigDB) tab of the Chip platform selector window.

    Chip files, like gene set database files are versioned and have a species specific suffix, i.e. ".Hs.chip" following the MSigDB version (v2022.1 in this initial release) for files targeting the Human Collections, and ".Mm.chip" for files targeting the Mouse Collections.

    Both Human Collection Chips, and Mouse Collection Chips contain a full complement of Chips for native as well as orthology based analysis. Chips from the Human Collection Chips tab (files ending in .Hs.chip) that begin with Human_ convert human gene IDs to human symbols, files from this tab (still with the .Hs.chip suffix) that begin with Mouse_ or Rat_ will also have the text "Human_Orthologs" in the file name, and when used will apply orthology mapping data to convert the dataset to match the human gene symbols namespace of the MSigDB Human Collections database.

    Likewise, chips from the Mouse Collection Chips tab (files ending in .Mm.chip) that begin with Mouse convert mouse gene IDs to mouse symbols, files from this tab (still with the .Mm.chip suffix) that begin with Human or Rat_ will also have the text "Mouse_Orthologs" in the file name, and when used will apply orthology mapping data to convert the dataset to match the mouse gene symbols namespace of the MSigDB mouse Collections database.

    So, if the gene set database file c5.go.bp.v2022.1.Hs.symbols.gmt was selected from the Human Collection's tab, the appropriate CHIP file regardless of dataset species would be one of the _MSigDB.v.2022.1.Hs.chip files available from the Human Collection CHIPs (MSigDB) tab, and if the gene set database file m5.go.bp.v2022.1.Mm.symbols.gmt was selected from the Mouse Collection's tab the appropriate CHIP file regardless of dataset species would be one of the _MSigDB.v.2022.1.Mm.chip files available from the Mouse Collection CHIPs (MSigDB) tab.