MSigDB SQLite Database

Introduction#

With the release of MSigDB 2023.1 we have created a new SQLite database for the fully annotated gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Ms) resources. Each ships as a single-file database usable with any compliant SQLite client. No other downloads are necessary. This new format provides the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. Like the XML format that has been made available since the early days of MSigDB, the SQLite format has the advantage of being self-contained and portable and thus easy to distribute, archive, etc. In addition, the SQLite format allows us to open up the data to ad-hoc SQL queries.

Note that we will continue producing the XML file for now, but it should be considered deprecated with the intention to eventually be entirely removed in a future release.

Below we describe the design of the MSigDB relational database and provide some examples of useful SQL queries. General information about SQLite can be found at the end of this document.

The License Terms for MSigDB are available on our website.

Database Design#

Design Considerations#

The schema is designed to be easy and (reasonably) fast for end-users. We decided that some amount of denormalization (e.g. the collection_name and license_code columns on the gene_set table) makes the database easier to understand and use.

Similarly, we wanted to prevent extraneous information from causing the design to be more difficult to use. Thus, each database file will hold only ONE MSigDB release for ONE resource, either Human or Mouse, with very little in the way of history tracking. It was necessary to ship the resources separately to prevent conflicts between them (there are gene sets in both with identical names, for example), but doing so also simplifies their use.

This schema is designed to be a read-only resource. After an MSigDB version is released it doesn't change. Any changes mean a new version. Notably, this allows us to side-step the known limitations and potential issues of using SQLite in the context of multiple concurrent writers. These simply do not apply other than during initial creation. SQLite has no issues around multiple concurrent readers.

Schema#

Referring to the schema diagram below, the tables in blue are core to defining the gene sets and the genes they contain, while those in purple provide the metadata about the gene sets, the genes, and MSigDB itself. The tables in gray give data about gene sets that were considered for, but excluded from, the MSigDB release, as explained below.

MSigDB SQLite DB schema diagram

Note that in all cases of tables with an id primary key column, these primary key values are generated synthetically and will not be considered stable across different versions of MSigDB (and likewise when used as a foreign key). In other words, the id of a particular gene set, gene symbol, author, etc. will likely have a different value in the next version of MSigDB. While usable within a given database for JOIN queries and so on, these values should not be relied upon outside of that context.

The core (blue) tables:

The metadata (purple) tables:

The "external item" (gray) tables:

Example Queries#

The examples given here assume we are working with the MSigDB Human database from our Downloads page (msigdb_v2023.1.Hs.db is the current version at the time of this writing). Note that we ZIP the database to reduce its size, so you must decompress it first before use.

These examples also assume the use of the official SQLite command line shell to keep everything consistent across all platforms. The exact results may vary depending on the version of the database you are using and the particular query.

NOTE: there is a known issue where extraneous records were retained in certain tables in the MSigDB 2023.1.Hs and 2023.1.Mm release. While that won't affect the flow of the following discussion, we recommend addressing it before using the databases beyond the given examples. Please refer to those Release Notes for specific instructions.

Extracting gene sets in the GMT format#

One key use-case for performing SQL queries against the database involves building custom collections of gene sets, so those have been designed to be fast and convenient. For example, the following will select all the WikiPathways sets in the Human database into a GMT file named wikipathways.gmt:

    .mode tabs
    .once wikipathways.gmt
    SELECT standard_name 'na', group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE collection_name = 'C2:CP:WIKIPATHWAYS'
    GROUP BY standard_name ORDER BY standard_name ASC;

The basic template for creating GMTs is as follows:

    .mode tabs
    .once <filename>
    SELECT standard_name 'na', group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE <selection criteria>
    GROUP BY standard_name ORDER BY standard_name ASC;

Simply vary the criteria in the WHERE clause to determine the contents of the output GMT. The first two lines are SQLite specific directives (fill in the desired file name on line 2). Note that the second argument to the group_concat function is a quoted tab character.

Finding gene sets containing one or more specified genes#

Here's another simple example that finds the names of all gene sets which have BRCA1 or BRCA2 as a member:

    SELECT distinct(standard_name)
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id
      INNER JOIN gene_symbol gsym ON gsym.id = gsgs.gene_symbol_id
    WHERE symbol in ('BRCA1', 'BRCA2') ORDER BY standard_name;

    AAAYWAACM_HFH4_01
    ACTAYRNNNCCCR_UNKNOWN
    ACTGAAA_MIR30A3P_MIR30E3P
    ARID3B_TARGET_GENES
    ASH1L_TARGET_GENES
    <...etc...>

Extracting gene sets and their metadata#

This query gets all the Reactome sets after applying a size threshold of between 15 and 500 genes. Here we are also providing a full link to the gene set on the GSEA-MSigDB website in place of the ‘na’ of the earlier example:

    .mode tabs
    .once wikipathways_threshold.gmt
    SELECT standard_name,
           ( SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs' )
               ||'/geneset/'||standard_name,
           group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE collection_name = 'C2:CP:WIKIPATHWAYS'
    GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500
    ORDER BY standard_name ASC;

Note that here we are using a subquery to get the MSigDB_base_URL to build the website link:

    SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs'

This next query builds on our earlier example combined with the above to get all sets with either BRCA1 or BRCA2 as a member in that size range and save them to a GMT:

    .mode tabs
    .once BRCA1_BRCA2_sets.gmt
    SELECT standard_name,
           (SELECT MSigDB_base_URL FROM MSigDB WHERE version_name = '2023.1.Hs')
               ||'/geneset/'||standard_name,
           group_concat(symbol, '       ')
    FROM gene_set gset
      INNER JOIN gene_set_gene_symbol gsgs on gset.id = gene_set_id
      INNER JOIN gene_symbol gsym on gsym.id = gene_symbol_id
    WHERE gset.id IN
      ( SELECT distinct(gene_set_id)
        FROM gene_set_gene_symbol gsgs2
          INNER JOIN gene_symbol gsym2 ON gsym2.id = gsgs2.gene_symbol_id
        WHERE symbol in ('BRCA1', 'BRCA2') )
    GROUP BY standard_name HAVING count(symbol) BETWEEN 15 AND 500
    ORDER BY standard_name ASC;

This query gets some more detailed information about a particular named gene set, including the PubMed ID:

    .mode tabs
    .headers on
    SELECT collection_name, license_code, PMID AS PubMedID, description_brief
    FROM gene_set gset
      INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id
      INNER JOIN publication pub ON pub.id = publication_id
    WHERE standard_name = 'ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR';

    collection_name license_code    PubMedID    description_brief
    C2:CGP  CC-BY-4.0   17404513    Cell cycle genes significantly (p =< 0.05) changed in fibroblast cells at 6 h after exposure to ionizing radiation.

Now, get the Title and Authors for the PubMed ID from the above:

    SELECT title, group_concat(display_name) AS Authors
    FROM publication pub
      INNER JOIN publication_author pa ON publication_id = pub.id
      INNER JOIN author au ON author_id = au.id
    WHERE PMID = 17404513;

    title   Authors
    Identification of primary transcriptional regulation of cell cycle-regulated genes upon DNA damage. Zhou T,Chou J,Mullen TE,Elkon R,Zhou Y,Simpson DA,Bushel PR,Paules RS,Lobenhofer EK,Hurban P,Kaufmann WK

This query will find the External Term(s) and Name(s) that were filtered out as similar by our redundancy check for a given GOBP gene set:

    SELECT et.term, external_name
    FROM external_term et
      INNER JOIN external_term_filtered_by_similarity etfbs ON etfbs.term = et.term
      INNER JOIN gene_set gset ON gset.id = etfbs.gene_set_id
    WHERE standard_name = 'GOBP_MITOTIC_SPINDLE_ELONGATION';

    term    external_name
    GO:0051256  mitotic spindle midzone assembly

Extracting a summary of gene sets#

This query will extract a summary of selected gene sets with a short example WHERE clause to restrict it to the C5:GO collection only. You can add a more detailed WHERE clause and the column selection can be expanded or reduced as desired:

    SELECT standard_name, count(gene_symbol_id), collection_name,
           source_species_code, ns.label, contributor, PMID
    FROM gene_set gset
      INNER JOIN gene_set_details gsd ON gsd.gene_set_id = gset.id
      INNER JOIN namespace ns ON ns.id = primary_namespace_id
      LEFT JOIN publication pub ON publication_id = pub.id
      INNER JOIN gene_set_gene_symbol gsgs ON gsgs.gene_set_id = gset.id
    WHERE collection_name LIKE "C5:GO:%"
    GROUP BY standard_name ORDER BY standard_name limit 3;

    standard_name   count(gene_symbol_id)   collection_name source_species_code label   contributor PMID
    GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS    6   C5:GO:BP    HS  Human_NCBI_Gene_ID  Gene Ontology   
    GOBP_2FE_2S_CLUSTER_ASSEMBLY    11  C5:GO:BP    HS  Human_NCBI_Gene_ID  Gene Ontology   
    GOBP_2_OXOGLUTARATE_METABOLIC_PROCESS   17  C5:GO:BP    HS  Human_NCBI_Gene_ID  Gene Ontology   

Finding gene sets overlapping with a list of genes using Jaccard Similarity#

This query computes the Jaccard Similarity of a list of HUGO gene symbols, held one per line in a text file named members.txt, across all of the gene sets in MSigDB (here is an example file). Use MGI symbols if working with the Mouse database:

    .import --schema test members.txt member_list
    .mode tabs
    .headers on
    WITH QuerySet(member) AS (SELECT symbol FROM member_list)
    SELECT standard_name, sum(InQuerySet) AS UnionCount,
      (sum(NotInQuerySet) + (SELECT count(member) FROM QuerySet)) AS IntersectionCount,
      CAST(sum(InQuerySet) AS REAL)/(sum(NotInQuerySet) +
          (SELECT count(member) FROM QuerySet)) AS JaccSim
    FROM ( SELECT standard_name,
              CASE WHEN symbol IN ( SELECT member FROM QuerySet ) 
                   THEN 1 ELSE 0 END   AS InQuerySet,
              CASE WHEN symbol NOT IN ( SELECT member FROM QuerySet ) 
                   THEN 1 ELSE 0 END   AS NotInQuerySet
           FROM gene_set gset
             INNER JOIN gene_set_gene_symbol gsgs ON gset.id = gsgs.gene_set_id
             INNER JOIN gene_symbol gsym ON gsgs.gene_symbol_id = gsym.id )
    GROUP BY standard_name ORDER BY JaccSim DESC LIMIT 20;

    standard_name   UnionCount  IntersectionCount   JaccSim
    SOGA_COLORECTAL_CANCER_MYC_UP   79  170 0.464705882352941
    WP_PYRIMIDINE_METABOLISM    24  227 0.105726872246696
    KEGG_PURINE_METABOLISM  31  295 0.105084745762712
    KEGG_PYRIMIDINE_METABOLISM  24  241 0.0995850622406639
    GOBP_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS  18  191 0.0942408376963351
    GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS  16  185 0.0864864864864865
    GOBP_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 19  225 0.0844444444444444
    REACTOME_METABOLISM_OF_NUCLEOTIDES  20  244 0.0819672131147541
    GOBP_RIBONUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS 16  211 0.0758293838862559
    REACTOME_NUCLEOTIDE_BIOSYNTHESIS    11  170 0.0647058823529412
    GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_BIOSYNTHETIC_PROCESS   11  178 0.0617977528089888
    MODULE_219  11  183 0.0601092896174863
    SCHUHMACHER_MYC_TARGETS_UP  14  233 0.0600858369098712
    GOBP_PURINE_NUCLEOSIDE_MONOPHOSPHATE_METABOLIC_PROCESS  11  201 0.054726368159204
    GSE33292_WT_VS_TCF1_KO_DN3_THYMOCYTE_DN 19  348 0.0545977011494253
    GOBP_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS  24  440 0.0545454545454545
    GOBP_GMP_BIOSYNTHETIC_PROCESS   9   172 0.0523255813953488
    GOBP_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS  20  385 0.051948051948052
    MODULE_102  9   177 0.0508474576271186
    GOBP_NUCLEOBASE_BIOSYNTHETIC_PROCESS    9   177 0.0508474576271186

Building the Legacy MSigDB XML File#

The following query will create the legacy deprecated MSigDB XML file, for any users that still need it. It will leave out a small amount of information related to the Hallmark collection gene sets, specifically the FOUNDER_NAMES, REFINEMENT_DATASETS, and VALIDATION_DATASETS. This info remains available from the GSEA website.

    .output "msigdb_user.v2023.1.Hs.xml"
    .headers off
    .print '<?xml version="1.0" encoding="UTF-8"?>'
    .print '<!-- Copyright (c) 2004-2023 Broad Institute, Inc., Massachusetts Institute of Technology, and Regents of the University of California.  All rights reserved. -->'
    SELECT format('<MSIGDB NAME="msigdb_user" VERSION="%s" BUILD_DATE="%s">',
        version_name, version_date)
    FROM MSigDB
    WHERE version_name = '2023.1.Hs';
    SELECT format(
        '  <GENESET STANDARD_NAME="%s" SYSTEMATIC_NAME="%s" HISTORICAL_NAME="" ORGANISM="%s" PMID="%s" AUTHORS="%s" GEOID="%s" EXACT_SOURCE="%s" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="%s" CHIP="%s" CATEGORY_CODE="%s" SUB_CATEGORY_CODE="%s" CONTRIBUTOR="%s"" CONTRIBUTOR_ORG="%s" DESCRIPTION_BRIEF="%s" DESCRIPTION_FULL="%s" TAGS="" MEMBERS="%s" MEMBERS_SYMBOLIZED="%s" MEMBERS_EZID="%s" MEMBERS_MAPPING="%s" FILTERED_BY_SIMILARITY="%s" FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS=""/>',
         standard_name, systematic_name, species_name, PMID, authorList, GEO_id,
         exact_source, external_details_URL, ns.label,
         iif(instr(collection_name, ':'), substr(collection_name,0,instr(collection_name,':')), collection_name),
         iif(instr(collection_name, ':'), substr(collection_name,instr(collection_name,':')+1), ''),
         contributor, contrib_organization,
         description_brief, description_full,
         memberList, symbolList, ezidList, mappingList, fbsTerms
      )
    FROM gene_set gset
    INNER JOIN gene_set_details gsd ON gset.id = gsd.gene_set_id
    INNER JOIN namespace ns ON primary_namespace_id = ns.id
    INNER JOIN species sp ON ns.species_code = sp.species_code
    INNER JOIN (
      SELECT gene_set_id, group_concat(source_id) as memberList,
         group_concat(symbol) as symbolList,
         group_concat(NCBI_id) as ezidList,
         group_concat(mapping,'|') as mappingList
      FROM (
        SELECT gssm.gene_set_id, source_id, symbol, NCBI_id,
               source_id||','||symbol||','||NCBI_id AS mapping
        FROM gene_set_source_member gssm
        INNER JOIN source_member sm ON gssm.source_member_id = sm.id
        INNER JOIN gene_symbol gsym ON sm.gene_symbol_id = gsym.id
        ORDER BY gssm.gene_set_id, symbol) AS sym_info
      GROUP BY gene_set_id ) AS syms ON syms.gene_set_id = gset.id
    LEFT JOIN (
      SELECT gene_set_id, group_concat(term) AS fbsTerms
      FROM external_term_filtered_by_similarity
      GROUP BY gene_set_id) AS extTerms ON extTerms.gene_set_id = gset.id
    LEFT JOIN (
      SELECT publication_id, PMID, group_concat(pub_auth.display_name, ',') AS authorList
      FROM publication pub
      INNER JOIN (
        SELECT publication_id, display_name
        FROM publication_author pa
        INNER JOIN author au ON au.id = pa.author_id
        ORDER BY publication_id, author_order
        ) AS pub_auth ON pub_auth.publication_id = pub.id
      GROUP BY publication_id) AS pub_info ON pub_info.publication_id = gsd.publication_id
    ORDER BY collection_name, lower(standard_name);
    .print '</MSIGDB>'
    .output
    .headers on

About SQLite#

The official SQLite documentation is available at https://www.sqlite.org and an (unofficial) introductory tutorial is available at https://www.sqlitetutorial.net.

As a single-file database format, SQLite is well suited to our needs.