Train a machine learning model on a collection#

Here, we iterate over the artifacts within a collection to train a machine learning model at scale.

import lamindb as ln

💡 connected lamindb: testuser1/test-scrna

ln.settings.transform.stem_uid = "Qr1kIHvK506r"
ln.settings.transform.version = "1"
ln.track()

💡 notebook imports: lamindb==0.70.3 torch==2.2.2

💡 saved: Transform(uid='Qr1kIHvK506r5zKv', name='Train a machine learning model on a collection', key='scrna5', version='1', type='notebook', updated_at=2024-04-22 10:42:57 UTC, created_by_id=1)

💡 saved: Run(uid='UCL7Iwd4PRERQmGqlNFc', transform_id=5, created_by_id=1)

Query our collection:

collection = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="2"
).one()
collection.describe()

Show code cell output Hide code cell output

Collection(uid='EKDkFpNDlBJg7JJUHUBQ', name='My versioned scRNA-seq collection', version='2', hash='HNR3VFV60_yqRnUka11E', visibility=1, updated_at=2024-04-22 10:42:34 UTC)

Provenance:
  📎 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type='notebook')
  📎 run: Run(uid='dB2PAgeaKiIqmwXpzyDG', started_at=2024-04-22 10:42:13 UTC, is_consecutive=True)
  📎 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
  📎 input_of (core.Run): ['2024-04-22 10:42:45 UTC']
Features:
  var: FeatureSet(uid='LLmkxVjoM8TbB4g7Co6m', n=36508, type='number', registry='bionty.Gene')
    'AATK', 'PCDH19', 'RCBTB1', 'BCAN-AS1', 'GRIA1', 'DPEP3', 'FAM222B', 'RASGEF1B', 'AFG1L', 'BHLHE41', 'RHNO1', 'EYS', 'NDFIP2', 'UNC5B', 'CSNK1G3', 'C19orf53', 'LINC00557', 'TRMT9B', 'RAB43', 'TOR4A', ...
  obs: FeatureSet(uid='cz0e1Iz4dZSe91DZy0we', n=4, registry='core.Feature')
    🔗 donor (12, core.ULabel): 'D496', 'A31', 'A29', 'D503', 'A35', 'A52', '621B', '637C', '582C', 'A37', ...
    🔗 tissue (17, bionty.Tissue): 'blood', 'omentum', 'lamina propria', 'sigmoid colon', 'skeletal muscle tissue', 'thoracic lymph node', 'mesenteric lymph node', 'caecum', 'thymus', 'liver', ...
    🔗 cell_type (40, bionty.CellType): 'CD4-positive helper T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'animal cell', 'lymphocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'alveolar macrophage', 'alpha-beta T cell', 'dendritic cell', 'non-classical monocyte', 'progenitor cell', ...
    🔗 assay (3, bionty.ExperimentalFactor): '10x 5' v1', '10x 3' v3', '10x 5' v2'
Labels:
  📎 tissues (17, bionty.Tissue): 'blood', 'omentum', 'lamina propria', 'sigmoid colon', 'skeletal muscle tissue', 'thoracic lymph node', 'mesenteric lymph node', 'caecum', 'thymus', 'liver', ...
  📎 cell_types (40, bionty.CellType): 'CD4-positive helper T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'animal cell', 'lymphocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'alveolar macrophage', 'alpha-beta T cell', 'dendritic cell', 'non-classical monocyte', 'progenitor cell', ...
  📎 experimental_factors (3, bionty.ExperimentalFactor): '10x 5' v1', '10x 3' v3', '10x 5' v2'
  📎 ulabels (12, core.ULabel): 'D496', 'A31', 'A29', 'D503', 'A35', 'A52', '621B', '637C', '582C', 'A37', ...

Create a map-style dataset#

Let us create a map-style dataset using using mapped(): a MappedCollection. This is what, for example, the PyTorch DataLoader expects as an input.

Under-the-hood, it performs a virtual inner join of the features of the underlying AnnData objects and thus allows to work with very large collections.

You can either perform a virtual inner join:

with collection.mapped(obs_keys=["cell_type"], join="inner") as dataset:
    print(len(dataset.var_joint))

Or a virtual outer join:

dataset = collection.mapped(obs_keys=["cell_type"], join="outer")

len(dataset.var_joint)

This is compatible with a PyTorch DataLoader because it implements __getitem__ over a list of backed AnnData objects. The 5th cell in the collection can be accessed like:

dataset[5]

The labels are encoded into integers:

dataset.encoders

Create a pytorch DataLoader#

Let us use a weighted sampler:

from torch.utils.data import DataLoader, WeightedRandomSampler

# label_key for weight doesn't have to be in labels on init
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_type"), num_samples=len(dataset)
)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler)

We can now iterate through the data loader:

for batch in dataloader:
    pass

Close the connections in MappedCollection:

dataset.close()