MSigDB v7.0 (Aug 2019)

This is a major release that includes substantial updates to gene set annotations, gene symbol mapping procedures, overhaul of several collections/sub-collections, and corrections to miscellaneous errors.

Note: Due to substantial changes in MSigDB, it is recommended that users migrate to GSEA 4.0.0+ when utilizing MSigDB 7.0+ resources. Advisory: It is strongly recommended that users of MSigDB7/GSEA4.0 always use the GSEA "Collapse dataset to gene symbols" feature with the provided Symbol Remapping chip file if your dataset was generated with a transcriptome other than Ensembl 97/GENCODE 31.

Changes to MSigDB Gene Symbol Mapping Procedures#

Now Using Ensembl as the Platform Annotation Authority#

Beginning in MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl. This change mitigates a previous issue where retired gene symbols and symbol aliases that did not reflect the current annotation of the human genome were retained in MSigDB as a result of outdated microarray and transcriptome annotations. This issue resulted in symbols being excluded from some gene sets and GSEA analyses due to the potential presence of multiple symbols for the same gene in different gene sets as a result of differing source annotations for those gene sets, and mismatches between the symbols present in the user supplied dataset and those included in MSigDB.

Change to Gene Orthology Mapping Procedure for Non-Human Gene Sets#

CHIP File Updates#

Changes to Data Set Handling Recommendations#

This remapping is not necessary if your data set was generated using Ensembl 97 or GENCODE 31 transcriptomes. This is a change from our previous recommendation.

Global Change to MSigDB Gene Set Inclusion Criteria#

As of MSigDB 7.0 the minimum size threshold for inclusion of a gene set in an MSigDB collection has been reduced to 5 unique gene symbols. This global filter threshold was previously set at 10 unique symbols. This change primarily affects gene sets in the C5:G0 and C2:CP:Reactome collections. This does not affect the default thresholds in the GSEA application.

Updates to Gene Sets by Collection#

C1 (Positional Gene Sets) — Major Overhaul#

C1 has been rebuilt to reflect the primary assembly of the current release of the Human Genome as present in Ensembl 97 and GENCODE 31 (GRCh38.p12). Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from the Ensembl BioMart (version 97) and reflect the gene architecture as represented on the primary assembly. This resulted in a small reduction in the number of gene sets (-27), as sets representing complete chromosome arms with few annotated genes were removed.

C2:CP:Reactome — Major Overhaul#

C2:CP:BioCarta — Content Revision#

Pathways curated from BioCarta have been revised to reflect the final versions available of the Human BioCarta pathways as represented on the NCI CGAP website. This resulted in an overall increase of +72 gene sets. Gene set names were also revised as a result of this change and several gene sets were removed including:

Additionally, missing genes from the BIOCARTA_STATHMIN_PATHWAY have been corrected.

C2:CP:PID — New Sub-Collection Heading#

Gene sets from the Pathway Interaction Database have been given a top-level sub-collection heading (PID) within C2:CP.

C2:CGP — Miscellaneous Corrections to Curated Gene Sets#

C2:CGP — Miscellaneous Deprecated Sets Removed#

C5 (Gene Ontology Collection) — Major Overhaul#

Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology Nature Genet 2000). The gene sets are named by GO term and contain genes annotated by that term. We have replaced the entire collection with new gene sets using recent GO term annotations (based on downloads from GO on February 21, 2019).

This collection is divided into three sub-collections:

Outline of the procedure:

All sets are based on associations of GO terms to human genes. Genes annotated with the same GO term make the corresponding GO term gene set.

The input files are:

This file reports GO terms that have been associated with genes in NCBI Entrez Gene. It is generated by processing the gene_association file on the GO FTP site and comparing the DB_Object_ID to annotation in NCBI Entrez Gene, as also reported in gene_info.gz. The file is available here. It is a tab delimited plain text file with one tax_id / gene_id / evidence_code per line.

This file contains the entire GO ontology in l OBO v.1.2 format.

This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 5 or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into "chunks" using the hclust function from the R stats package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).

A previous version of the C5 collection contained 864 gene sets that were founder sets for one or more gene set in the MSigDB Hallmark collection. These deprecated C5 sets are included in MSigDB 7.0 as an ARCHIVED collection in order to preserve links to their pages from the hallmark gene set pages.

C6 (Oncogenic Signatures) — Miscellaneous Corrections#

The gene sets have been corrected to PGF_UP.V1.UP and PGF_UP.V1.DN respectively, and correctly linked to NCBI Gene ID: 5228.

This gene set had been incorrectly annotated as a signature of genes up-regulated in response to knockout of the nuclear factor NRF2. This gene set properly represents the signature of genes down-regulated upon NFE2L2.V2 knockout and has been corrected to reflect this.

Additionally, this gene set had been miss-attributed to Malhotra et al., PubMed ID 20460467, the correct publication of Kim et al., PubMed ID: 27088724 has been assigned.

Appendix 1: UniGene Derived Gene Sets Removed from C2:CGP#