scrna4/6 Jupyter Notebook lamindata

Analyze a collection in memory

Here, we’ll analyze the growing collection by loading it into memory. This is only possible if it’s not too large. If your data is large, you’ll likely want to iterate over the collection to train a model, the topic of the next page (scrna5/6).

import lamindb as ln
import bionty as bt

ln.context.uid = "mfWKm8OtAzp80000"
ln.context.track()
Hide code cell output
→ connected lamindb: testuser1/test-scrna
→ notebook imports: bionty==0.50.2 lamindb==0.76.8 scanpy==1.10.3
→ created Transform(uid='mfWKm8OtAzp80000') & created Run(started_at='2024-09-24 13:45:22 UTC')
ln.Collection.df()
Hide code cell output
uid version is_latest name description hash reference reference_type visibility transform_id meta_artifact_id run_id created_by_id updated_at
id
2 PO6HzRR8uqjnVkJa0001 2 True My versioned scRNA-seq collection None dBJLoG6NFZ8WwlWqnfyFdQ None None 1 2 None 2 1 2024-09-24 13:45:12.134025+00:00
1 PO6HzRR8uqjnVkJa0000 None False My versioned scRNA-seq collection None exJtsBYH53iiebYH-Qx0sw None None 1 1 None 1 1 2024-09-24 13:45:12.123461+00:00
collection = ln.Collection.get(
    name="My versioned scRNA-seq collection", version="2"
)
collection.artifacts.df()
Hide code cell output
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1 CjnpmTLs4AR6KB2i0000 None True Human immune cells from Conde22 None .h5ad dataset 57612943 9sXda5E7BYiVoDOQkTC0KB None 1648 sha1-fl AnnData 1 True 1 1 1 1 2024-09-24 13:44:44.607825+00:00
2 nlQUiApoL6ZPRh2Y0000 None True 10x reference adata None .h5ad dataset 853388 jxR7kj0-xk-84u5sv3J9CQ None 70 md5 AnnData 1 True 1 2 2 1 2024-09-24 13:45:12.030123+00:00

If the collection isn’t too large, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the collection, rather than the individual pieces.

adata = collection.load()

The default is an outer join during concatenation as in pandas:

adata
Hide code cell output
AnnData object with n_obs × n_vars = 1718 × 36503
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain', 'donor', 'tissue', 'assay', 'artifact_uid'
    obsm: 'X_pca', 'X_umap'

The AnnData has the reference to the individual artifacts in the .obs annotations:

adata.obs.artifact_uid.cat.categories
Hide code cell output
Index(['nlQUiApoL6ZPRh2Y0000', 'CjnpmTLs4AR6KB2i0000'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = bt.Gene.lookup(field="symbol")
genes.itm2b.ensembl_gene_id
Hide code cell output
'ENSG00000136156'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)
sc.pl.pca(
    adata,
    color=genes.itm2b.ensembl_gene_id,
    title=(
        f"{genes.itm2b.symbol} / {genes.itm2b.ensembl_gene_id} /"
        f" {genes.itm2b.description}"
    ),
    save="_itm2b",
)
WARNING: saving figure to file figures/pca_itm2b.pdf
_images/1f69f6d7c131b526d7715d3b396c48a7906ff56890b4614bcdcd99ef70e580fb.png

We could save a plot as a pdf and then see it in the flow diagram:

artifact = ln.Artifact("./figures/pca_itm2b.pdf", description="My result on ITM2B")
artifact.save()
artifact.view_lineage()
_images/91cb8c1a527ef17f05ec08b76cd854389f21ca97d0866abf39e3f351b03e02d4.svg

But given the image is part of the notebook, we can also rely on the report that we create when saving the notebook:

ln.context.finish()