scrna2/6 Jupyter Notebook lamindata

Standardize and append a batch of data#

Here, we’ll learn

  • how to standardize a less well curated collection

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.settings.verbosity = "hint"
bt.settings.auto_save_parents = False
ln.settings.transform.stem_uid = "ManDYgmftZ8C"
ln.settings.transform.version = "1"
ln.track()
💡 connected lamindb: testuser1/test-scrna
💡 Assuming editor is Jupyter Lab.
💡 notebook imports: bionty==0.42.9 lamindb==0.70.3
💡 saved: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type='notebook', updated_at=2024-04-22 10:42:13 UTC, created_by_id=1)
💡 saved: Run(uid='dB2PAgeaKiIqmwXpzyDG', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_dB2PAgeaKiIqmwXpzyDG.txt

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"
annotate = ln.Annotate.from_anndata(adata, var_index=bt.Gene.symbol, categoricals={adata.obs.cell_type.name: bt.CellType.name})
3 non-validated categories are not saved in Feature.name: ['n_genes', 'percent_mito', 'louvain']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
✅ added 5 records from public with Gene.symbol for var_index: ['GPX1', 'SOD2', 'RN7SL1', 'SNORD3B-2', 'IGLL5']
11 non-validated categories are not saved in Gene.symbol: ['RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index

Standardize & validate genes #

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
💡 standardized 754/765 terms

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
annotate = ln.Annotate.from_anndata(adata_validated, var_index=bt.Gene.ensembl_gene_id, categoricals={"cell_type": bt.CellType.name})
3 non-validated categories are not saved in Feature.name: ['n_genes', 'percent_mito', 'louvain']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
annotate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
💡 mapping cell_type on CellType.name
9 terms are not validated: 'CD8+/CD45RA+ Naive Cytotoxic', 'CD56+ NK', 'CD4+/CD45RO+ Memory', 'CD34+', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD19+ B', 'Dendritic cells', 'CD14+ Monocytes'
      → save terms via .add_new_from('cell_type')
False

Standardize & validate cell types #

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_public(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)
Hide code cell output
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

annotate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
✅ cell_type is validated against CellType.name
True

Register #

artifact = annotate.save_artifact(description="10x reference adata")
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/2rM782zVcyLx46qGM2EU.h5ad')
✅ storing artifact '2rM782zVcyLx46qGM2EU' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/2rM782zVcyLx46qGM2EU.h5ad'
💡 parsing feature names of X stored in slot 'var'
754 terms (100.00%) are validated for ensembl_gene_id
✅    linked: FeatureSet(uid='QTZfL7djujTuGfD3Kf5T', n=754, type='number', registry='bionty.Gene', hash='j8QkIeLBgJwsscY4vVPx', created_by_id=1)
💡 parsing feature names of slot 'obs'
1 term (25.00%) is validated for name
3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✅    linked: FeatureSet(uid='8gZ1gzgpX0BJIpLsfG1a', n=1, registry='core.Feature', hash='fp36kVAp_2vTEn7J_95q', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
artifact.view_lineage()
_images/f91284e769023876127b356eb10df591f3f9d3139a71f03662ef961edf780b1c.svg

Append the dataset to the collection#

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.artifacts[0]],
    is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)
Hide code cell output
✅ loaded: FeatureSet(uid='cz0e1Iz4dZSe91DZy0we', n=4, registry='core.Feature', hash='99kZKBzklDDXJUe9Q2mZ', updated_at=2024-04-22 10:42:06 UTC, created_by_id=1)
💡 adding collection [1] as input for run 2, adding parent transform 1
💡 adding artifact [1] as input for run 2, adding parent transform 1
✅ saved 1 feature set for slot: 'var'
💡 transferring cell_type
💡 transferring donor
💡 transferring tissue
💡 transferring cell_type
💡 transferring assay

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Collection(uid='EKDkFpNDlBJg7JJUHUBQ', name='My versioned scRNA-seq collection', version='2', hash='HNR3VFV60_yqRnUka11E', visibility=1, updated_at=2024-04-22 10:42:34 UTC)

Provenance:
  📎 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type='notebook')
  📎 run: Run(uid='dB2PAgeaKiIqmwXpzyDG', started_at=2024-04-22 10:42:13 UTC, is_consecutive=True)
  📎 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
  var: FeatureSet(uid='LLmkxVjoM8TbB4g7Co6m', n=36508, type='number', registry='bionty.Gene')
    'TIMM23', 'LCE6A', 'CR1', 'FLRT2-AS1', 'COL4A2-AS2', 'PHIP', 'JRKL', 'UBR3', 'CUEDC2', 'GPR50-AS1', 'MSL2', 'ALDH1A2-AS1', 'P3H2', 'CFAP36', 'ACOD1', 'DBN1', 'CHP1', 'GLYATL3', 'NDUFA8', 'CDK2AP2', ...
  obs: FeatureSet(uid='cz0e1Iz4dZSe91DZy0we', n=4, registry='core.Feature')
    🔗 donor (12, core.ULabel): 'A37', 'A36', 'A29', '640C', 'A52', 'A31', '637C', '582C', 'D496', 'A35', ...
    🔗 tissue (17, bionty.Tissue): 'transverse colon', 'jejunal epithelium', 'thymus', 'liver', 'duodenum', 'caecum', 'thoracic lymph node', 'bone marrow', 'lamina propria', 'lung', ...
    🔗 cell_type (40, bionty.CellType): 'effector memory CD4-positive, alpha-beta T cell', 'CD8-positive, alpha-beta memory T cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'alveolar macrophage', 'animal cell', 'CD16-negative, CD56-bright natural killer cell, human', 'CD16-positive, CD56-dim natural killer cell, human', ...
    🔗 assay (3, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v1', '10x 5' v2'
Labels:
  📎 tissues (17, bionty.Tissue): 'transverse colon', 'jejunal epithelium', 'thymus', 'liver', 'duodenum', 'caecum', 'thoracic lymph node', 'bone marrow', 'lamina propria', 'lung', ...
  📎 cell_types (40, bionty.CellType): 'effector memory CD4-positive, alpha-beta T cell', 'CD8-positive, alpha-beta memory T cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'alveolar macrophage', 'animal cell', 'CD16-negative, CD56-bright natural killer cell, human', 'CD16-positive, CD56-dim natural killer cell, human', ...
  📎 experimental_factors (3, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v1', '10x 5' v2'
  📎 ulabels (12, core.ULabel): 'A37', 'A36', 'A29', '640C', 'A52', 'A31', '637C', '582C', 'D496', 'A35', ...

View data lineage:

collection_v2.view_lineage()
_images/15df28cf783b98f0238d9fcc3ba1247212ea3f97352ae8f58bd24e98a2f16e87.svg