MSigDB v2023.1.Hs (Mar 2023)

Important Notices#

This page describes updates made to the Molecular Signatures Database Human Collections for release 2023.1 (MSigDB 2023.1.Hs). In order to access the MSigBD mouse collections through the GSEA UI, the GSEA 4.3.0 or newer is required. MSigDB v2023.1 is based on gene annotation data from Ensembl Release 109 (Feb 2023).

Updates to Human Collections (MSigDB v2023.1.Hs)#

C1: positional gene sets#

Updated human gene annotations to Ensembl 109 (+1 gene set).

C2:CGP#

6 Gene sets contributed by MSigDB users have been added to C2:CGP

C2:CP:Reactome#

C2:CP:WikiPathways#

WikiPathways gene sets have been updated to the February 10, 2023 release (+21 gene sets).

C3:TFT:GTRD#

GTRD data was updated to the 21.12 release. (-12 gene sets)

C5:GO (Gene Ontology)#

Gene sets in these sub-collections are derived from the controlled vocabulary of the Gene Ontology (GO) project: The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology (Nature Genet 2000). The gene sets are named by GO term and contain genes annotated by that term. This collection has been updated to the most recent GO annotations as present in the GO-basic obo file released on 2023-01-01 and NCBI gene2go annotations downloaded on 2023-02-10.

This collection is divided into three sub-collections:

These updates were generated in accordance with the procedure described in the GO release notes for MSigDB 7.0.

C5:HPO (Human Phenotype Ontology)#

Gene sets in this sub-collection have been updated to reflect the 2023-01-27 release of the Human Phenotype Ontology database (+263 gene sets). This sub-collection has been redundancy filtered through a procedure comparable to that of the GO and Reactome sub-collections.

C8 cell type signature gene sets#

Added gene sets describing lung cell type identity signatures from He P., Lim K., et al. 2022 A human fetal lung cell atlas uncovers proximal-distal gradients of differentiation and key regulators of epithelial fates. (https://lungcellatlas.org) (+126 gene sets)

CHIP file updates#

SQLite Database#

With this release we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. This new format brings the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. See our documentation for more details on the contents and usage.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Known Issue: Incidental extra Gene Symbol and Source Member records#

When building MSigDB we create a somewhat larger database and then remove a number of gene sets based on various criteria. The most important of those is a size-based threshold, where gene sets with fewer than 5 or more than 2000 members mapped to gene symbols are removed, but there are others as well.

Well after the release, it was discovered (in July 2023) that certain tables related to Gene Symbols and Source Members had not been properly purged after removal of such gene sets, resulting in the retention of unrelated records. We have created a patched database file that removes these extra records (available on our Download page).

SQLite can be somewhat loose when it comes to referential integrity and allows situations like this to exist even when foreign keys are enabled. This won't cause a problem in most cases where queries are based on the gene_set table, but for those users with the original database it's still a good idea to remove these extra records. This can be accomplished with the following SQL commands from the official SQLite command line shell:

    DELETE FROM gene_set_details WHERE gene_set_id NOT IN (SELECT id FROM gene_set);
    DELETE FROM gene_set_source_member WHERE gene_set_id NOT IN (SELECT id FROM gene_set);
    DELETE FROM source_member WHERE id NOT IN (SELECT distinct(source_member_id) FROM gene_set_source_member);
    DELETE FROM gene_set_gene_symbol WHERE gene_set_id NOT IN (SELECT id FROM gene_set);
    DELETE FROM gene_symbol WHERE id NOT IN (SELECT gene_symbol_id FROM gene_set_gene_symbol);
    VACUUM;
    PRAGMA optimize;

That final DELETE has no actual effect on the database since it turns out that all gene symbol records are referenced by some remaining gene set. It's included here anyway since that might not always be the case in the future, should this issue ever come up again.

It is recommended to do this without turning on the Foreign Key PRAGMA as it will otherwise take an immensely long time to complete. With the FK enforcement off it runs very quickly.

The last two commands are optional but will optimize the size and access patterns. It's probably a marginal benefit but it's worth doing as a one-time operation.

We have decided not to update the files on the server to avoid confusion about contents compared to those who might have already downloaded the files. We might find another way to avoid this in the future, in which case these Release Notes will be updated.