Gene Ontology (GO)#
Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.
In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.
In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.
Setup#
!lamin init --storage ./use-cases-registries --schema bionty
Show code cell output
💡 connected lamindb: testuser1/use-cases-registries
import lamindb as ln
import bionty as bt
import gseapy as gp
bt.settings.organism = "human" # globally set organism
💡 connected lamindb: testuser1/use-cases-registries
Fetch GO pathways annotated with human genes using Enrichr#
First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.
go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")
Number of pathways 5406
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']
Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}
def parse_ontology_id_from_keys(key):
"""Parse out the ontology id.
"ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
"""
id = key.split(" ")[-1].replace("(", "").replace(")", "")
name = key.replace(f" ({id})", "")
return (id, name)
go_bp_parsed = {}
for key, genes in go_bp.items():
id, name = parse_ontology_id_from_keys(key)
go_bp_parsed[id] = (name, genes)
go_bp_parsed["GO:0036500"]
('ATF6-mediated Unfolded Protein Response',
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])
Register pathway ontology in LaminDB#
bionty = bt.Pathway.public()
bionty
PublicOntology
Entity: Pathway
Organism: all
Source: go, 2023-05-10
#terms: 47514
📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object
Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.
Register pathway terms#
To register the pathways we make use of .from_values
to directly parse the annotated GO pathway ontology IDs into LaminDB.
pathway_records = bt.Pathway.from_values(go_bp_parsed.keys(), bt.Pathway.ontology_id)
ln.save(pathway_records, parents=False) # not recursing through parents
Register gene symbols#
Similarly, we use .from_values
for all Pathway associated genes to register them with LaminDB.
all_genes = {g for genes in go_bp.values() for g in genes}
gene_records = bt.Gene.from_values(all_genes, bt.Gene.symbol)
Show code cell output
❗ ambiguous validation in Bionty for 1082 records: 'DEFB104A', 'DPY19L2P2', 'AGBL2', 'BORCS5', 'WDR45', 'ZNF501', 'HNRNPH1', 'RFLNB', 'CLDN7', 'SOX7', 'ZNF496', 'ADAM2', 'GPI', 'OR12D1', 'CDH4', 'TRIM71', 'RNF31', 'SH3D19', 'ZNF780B', 'DUS4L', ...
❗ did not create Gene records for 37 non-validated symbols: 'AFD1', 'MTRNR2L2', 'MTRNR2L11', 'MDRV', 'AZF1', 'DGS2', 'MTRNR2L8', 'LOC344967', 'TRL-AAG2-3', 'MTRNR2L13', 'MTRNR2L5', 'LOC122319436', 'MTRNR2L6', 'SEPTIN14P20', 'LOC112268384', 'MTRNR2L1', 'MTRNR2L4', 'MTRNR2L12', 'TRA', 'LOC100653049', ...
gene_records[:3]
[Gene(uid='3AwtfV2qGHSG', symbol='RPS6KA4', ensembl_gene_id='ENSG00000162302', ncbi_gene_ids='8986', biotype='protein_coding', description='ribosomal protein S6 kinase A4 ', synonyms='RSK-B|MSK2', organism_id=1, public_source_id=9, created_by_id=1),
Gene(uid='3fKtDy8FNj8Q', symbol='SLC13A3', ensembl_gene_id='ENSG00000158296', ncbi_gene_ids='64849', biotype='protein_coding', description='solute carrier family 13 member 3 ', synonyms='NADC3|SDCT2', organism_id=1, public_source_id=9, created_by_id=1),
Gene(uid='3AS84oO4Y4xh', symbol='PECR', ensembl_gene_id='ENSG00000115425', ncbi_gene_ids='55825', biotype='protein_coding', description='peroxisomal trans-2-enoyl-CoA reductase ', synonyms='TERP|SDR29C1|HSA250303', organism_id=1, public_source_id=9, created_by_id=1)]
ln.save(gene_records);
Link pathway to genes#
Now that we are tracking all pathways and genes records, we can link both of them to make the pathways even more queryable.
gene_records_ids = {record.symbol: record for record in gene_records}
for pathway_record in pathway_records:
pathway_genes = go_bp_parsed.get(pathway_record.ontology_id)[1]
pathway_genes_records = [gene_records_ids.get(gene) for gene in pathway_genes]
pathway_record.genes.set(pathway_genes_records)
Now genes are linked to pathways:
pathway_record.genes.list("symbol")
['XIAP', 'CAST', 'CARD8', 'CST7', 'CARD18']
Move on to the next analysis: Standardize metadata on-the-fly