MSigDB v7.1 (Mar 2020)

This release includes new sets for potential transcription factor and microRNA regulatory target genes, updates to gene sets from some external resources, and updates to gene symbol mappings.

Note: Due to substantial changes introduced in MSigDB 7.0, using GSEA 4.0.0+ is recommended when utilizing MSigDB 7.0+ resources. Advisory: It is strongly recommended that users of MSigDB 7.1 always use the GSEA "Collapse dataset to gene symbols" feature with the provided Symbol Remapping chip file if your dataset was generated with a transcriptome other than Ensembl v99/GENCODE v33.

C3 Collection: New Gene Set Resources and Other Updates#

MSigDB 7.1 introduces new content for the analysis of gene sets in the context of their targeting by microRNAs or Transcription Factors. With the introduction of the new content, several changes were made to the structure of the collection.

New miRNA Target Content from miRDB#

2377 new gene sets have been added to the "MIR: microRNA targets" sub-collection of C3.

This new collection subset consists of sets of human genes predicted to contain miRNA binding sites for the indicated human miRNA. The gene sets are derived from computationally predicted human gene targets of miRNAs using the MirTarget algorithm. Data was curated from miRDB v6.0 target predictions with MirTarget scores >80 (high confidence predictions). miRNAs catalogued in miRDB v6.0 are derived from miRBase v22 (March 2018).

Gene sets were filtered to include only those sets which contained >5 and <2000 members in the raw input list in accordance with MSigDB standard practice. Additionally, miRNAs with identical target lists were merged into a singe gene set record. This is indicated by multiple MIR numbers joined with an underscore.

See also the MIRDB subset collection details page.

New Transcription Factor Target Content from the Gene Transcription Regulation Database (GTRD)#

221 new gene sets have been added to the "TFT: transcription factor targets" sub-collection of C3.

This new collection subset consists of sets of human genes predicted to contain transcription factor binding sites in their promoter regions (-1000,+100 bp around the transcription start site) for the indicated transcription factor. The gene sets are derived from the Gene Transcription Regulation Database (GTRD) v19.10 uniform processing pipeline and represent a candidate list of potential regulatory targets for each transcription factor. Gene sets were filtered to include only those sets which contained >5 and <2000 members in the raw input list in accordance with MSigDB standard practice.

See also the GTRD subset collection details page.

Changes to the C3 Collection Structure#

C3 has been renamed from "C3: motif gene sets" to "C3: regulatory target gene sets" to better reflect the new content in the collection. The two C3 sub-collections have each been split into subsets:

Updates to Gene Sets by Collection#

C1 (Positional Gene Sets)#

C1 has been updated to reflect the primary assembly of the current release of the Human Genome as present in Ensembl 99 and GENCODE 33 (GRCh38). Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from the Ensembl BioMart (version 99) and reflect the gene architecture as represented on the primary assembly.

C2:CP:Reactome#

C5 (Gene Ontology Collection)#

Gene sets in this collection are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology Nature Genet 2000. The gene sets are named by GO term and contain genes annotated by that term. This collection has been updated to the most recent GO annotations as of January 15, 2020.

This collection is divided into three sub-collections:

These updates were generated in accordance with the procedure described in the GO release notes for MSigDB 7.0.

Updates to MSigDB Gene Symbol Mapping Procedures#

Update to Ensembl Annotations#

Beginning in MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service. MSigDB 7.1 incorporates annotation information exported from Ensembl release 99. All analysis run against MSigDB 7.1 gene sets should ensure that the dataset gene symbols match this Ensembl version/GENCODE release 33. Alternatively MSigDB 7.1 provides CHIP files designed to be used with the GSEA Collapse/Remap dataset feature which may be used to re-annotate the dataset.

Change to Gene Orthology Mapping Procedure for Non-Human Genes#

Previously in MSigDB 7.0 we implemented a ranking procedure whereby the best human orthologue for each non-human gene was selected using solely Ensembl orthology table statistics. MSigDB 7.1 replaces this procedure. MSigDB 7.1 utilizes best match orthology tables exported via the Alliance of Genome Resources orthology API. This implements a best match procedure derived based on consensus best matching designed in collaboration with Mouse Genome Informatics at the Jackson Lab.

CHIP File Updates#

All CHIP files previously provided in the standard MSigDB 7.0 release have been updated for MSigDB 7.1 in accordance with previously described procedures.

Miscellaneous Revisions#