Using RNA-seq Datasets with GSEA
Quantification Types and Input Data#
GSEA requires as input an expression dataset, which contains expression profiles for multiple samples. While the software supports multiple input file formats for these datasets, the tab-delimited GCT format is the most common. The first column of the GCT file contains feature identifiers (gene ids or symbols in the case of data derived from RNA-Seq experiments). The second column contains a description of the feature; this column is ignored by GSEA and may be filled with “NA”s. Subsequent columns contain the expression values for each feature, with one sample's expression value per column. It is important to note that there are no hard and fast rules regarding how a GCT file's expression values are derived. The important point is that they are comparable to one another across features within a sample and comparable to one another across samples. RNA-seq quantification pipelines typically produce quantifications containing one or more of the following:
- Counts/Expected Counts
- Transcripts per Million (TPM)
- FPKM/RPKM
These quantifications are not properly normalized for comparisons across samples.
Note: ssGSEA (single-sample GSEA) projections perform substantially different mathematical operations from standard GSEA. For the ssGSEA implementation, gene-level summed TPM serves as an appropriate metric for analysis of RNA-seq quantifications.
Count Normalization for Standard GSEA#
Normalizing RNA-seq quantification to support comparisons of a feature's expression levels across samples is important for GSEA. Normalization methods (such as, TMM, geometric mean) which operate on raw counts data should be applied prior to running GSEA.
Tools such as DESeq2 can be made to produce properly normalized data (normalized counts) which are compatible with GSEA. The DESeq2 module available through the GenePattern environment produces a GSEA compatible “normalized counts” table in the GCT format which can be directly used in the GSEA application.
Note: While GSEA can accept transcript-level quantification directly and sum these to gene-level, these quantifications are not typically properly normalized for between sample comparisons. As such, transcript level CHIP annotations are no longer provided by the GSEA-MSigDB team.
The GSEA algorithm ranks the features listed in a GCT file. It provides a number of alternative statistics that can be used for feature ranking. But in all cases (or at least in the cases where the dataset represents expression profiles for differing categorical phenotypes) the ranking statistics capture some measure of genes' differential expression between a pair of categorical phenotypes. While these metrics are widely used for RNA-seq datasets, the GSEA team has yet to fully evaluate whether these ranking statistics, originally selected for their effectiveness when used with Microarray-based expression data, are entirely appropriate for use with data derived from RNA-seq experiments.
RNA-Seq Data and Ensembl CHIP Files#
A GSEA analysis requires three different types of input data: a gene expression dataset in GCT format, the corresponding sample annotations in CLS format, and a collection of gene sets in GMT format. GSEA is typically used with gene sets from the Molecular Signatures Database (MSigDB), which consist of HUGO human gene symbols. However, gene expression data files may use other types of identifiers, depending on how the data were produced. To proceed with the analysis, GSEA converts the identifiers found in the data file to match the human symbols used in the gene set files. The conversion is performed using a CHIP file that provides the mapping between the two types of identifiers. Over the years, we have been providing CHIP files for all major microarray platforms. For example, we have CHIP files that list the mappings between Affymetrix probe set IDs and human genome symbols.
In RNA-Seq, gene expression is quantified by counting the number of sequencing reads that aligned to a genomic range, according to a reference genome assembly or transcript annotations. The majority of tools use Ensembl reference annotations for this purpose. To facilitate GSEA analysis of RNA-Seq data, we now also provide CHIP files to convert human and mouse Ensembl IDs to HUGO gene symbols. Ensembl annotation uses a system of stable IDs that have prefixes based on the species name plus the feature type, followed by a series of digits and a version, e.g., ENSG00000139618.1
. The new GSEA Ensembl CHIP files provide mappings for human, mouse, and rat gene identifiers (i.e., Ensembl IDs with prefixes ENSG, ENSMUSG, ENSRNOG).
To run GSEA with gene expression data specified with Ensembl identifiers:
- Prepare the GCT gene expression file such that identifiers are in the form of Ensembl IDs, but without the version suffix, e.g.,
ENSG00000139618
. - For RNA-Seq data, you will need normalize and filter out low count measurements, and perform other preprocessing as needed. Consult your local bioinformatician for help if unsure.
- Load the GCT and corresponding CLS files into GSEA.
- Choose gene sets to test—we usually recommend starting with the Hallmarks collection.
- Choose the CHIP file that matches the identifiers in the GCT file:
-
Human_ENSEMBL_Gene_ID_MSigDB.vX.chip
=> Ensembl ID prefix ENSG -Mouse_ENSEMBL_Gene_ID_MSigDB.vX.chip
=> Ensembl ID prefix ENSMUSG -Rat_ENSEMBL_Gene_ID_MSigDB.vX.chip
=> Ensembl ID prefix ENSRNOG
We have also added the gene-level Ensembl IDs to the website for use with the Investigate Gene Sets tools such as Compute Overlaps. As noted above, it is necessary to remove the version suffix from any supplied IDs.
Note: While GSEA can accept transcript-level quantification directly and sum these to gene-level, these quantifications are not typically properly normalized for between sample comparisons. As such, transcript level CHIP annotations are no longer provided by the GSEA-MSigDB team at this time.
Alternative Method: GSEA-Preranked#
This previously served as the GSEA team's recommended pipeline for analysis of RNA-seq data, however, we now recommend the normalized counts procedure described above. As an alternative to standard GSEA, analysis of data derived from RNA-seq experiments may also be conducted through the GSEA-Preranked tool.
In particular:
- Prior to conducting gene set enrichment analysis, conduct your differential expression analysis using any of the tools developed by the bioinformatics community (e.g., cuffdiff, edgeR, DESeq, etc).
- Based on your differential expression analysis, rank your features and capture your ranking in an RNK-formatted file. The ranking metric can be whatever measure of differential expression you choose from the output of your selected DE tool. For example, cuffdiff provides the (base 2) log of the fold change.
- Run GSEAPreranked, if the exact magnitude of the rank metric is not directly biologically meaningful select "classic" for your enrichment score (thus, not weighting each gene's contribution to the enrichment score by the value of its ranking metric).
Please note that if you choose to use any of the gene sets available from MSigDB in your analysis, you need to make sure that the features listed in your RNK file are genes, and the genes are identified by their HUGO gene symbols. All gene symbols listed in the RNK file must be unique, match the ENSEMBL version used in the targeted version of MSigDB, and we recommend the values of the ranking metrics be unique.