license: mit
ποΈ Model description
InstructCell is a multi-modal AI copilot that integrates natural language with single-cell RNA sequencing data, enabling researchers to perform tasks like cell type annotation, pseudo-cell generation, and drug sensitivity prediction through intuitive text commands. By leveraging a specialized multi-modal architecture and our multi-modal single-cell instruction dataset, InstructCell reduces technical barriers and enhances accessibility for single-cell analysis.
Instruct Version: Supports generating only the answer portion without additional explanatory text, providing concise and task-specific outputs.
π How to use
We provide a simple example for quick reference. This demonstrates a basic cell type annotation workflow.
Make sure to specify the paths for H5AD_PATH
and GENE_VOCAB_PATH
appropriately:
H5AD_PATH
: Path to your.h5ad
single-cell data file (e.g.,H5AD_PATH = "path/to/your/data.h5ad"
).GENE_VOCAB_PATH
: Path to your gene vocabulary file (e.g.,GENE_VOCAB_PATH = "path/to/your/gene_vocab.npy"
).
from mmllm.module import InstructCell
import anndata
import numpy as np
from utils import unify_gene_features
# Load the pre-trained InstructCell model from HuggingFace
model = InstructCell.from_pretrained("zjunlp/InstructCell-instruct")
# Load the single-cell data (H5AD format) and gene vocabulary file (numpy format)
adata = anndata.read_h5ad(H5AD_PATH)
gene_vocab = np.load(GENE_VOCAB_PATH)
adata = unify_gene_features(adata, gene_vocab, force_gene_symbol_uppercase=False)
# Select a random single-cell sample and extract its gene counts and metadata
k = np.random.randint(0, len(adata))
gene_counts = adata[k, :].X.toarray()
sc_metadata = adata[k, :].obs.iloc[0].to_dict()
# Define the model prompt with placeholders for metadata and gene expression profile
prompt = (
"Can you help me annotate this single cell from a {species}? "
"It was sequenced using {sequencing_method} and is derived from {tissue}. "
"The gene expression profile is {input}. Thanks!"
)
# Use the model to generate predictions
for key, value in model.predict(
prompt,
gene_counts=gene_counts,
sc_metadata=sc_metadata,
do_sample=True,
top_p=0.95,
top_k=50,
max_new_tokens=256,
).items():
# Print each key-value pair
print(f"{key}: {value}")
For more detailed explanations and additional examples, please refer to the Jupyter notebook demo.ipynb.
π Citation
If you use the code or data, please cite the following paper:
@article{fang2025instructcell,
title={A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following},
author={Fang, Yin and Deng, Xinle and Liu, Kangwei and Zhang, Ningyu and Qian, Jingyang and Yang, Penghui and Fan, Xiaohui and Chen, Huajun},
journal={arXiv preprint arXiv:2501.08187},
year={2025}
}