---
language:
- en
base_model: EleutherAI/pythia-410m
library_name: transformers
tags:
- biology
- scRNAseq
---
# Overview
This is the C2S-Pythia-410m-cell-type-prediction model, based on the Pythia-410m architecture developed by EleutherAI, 
fine-tuned using Cell2Sentence (C2S) on a diverse set of single-cell RNA sequencing (scRNA-seq) datasets from CellxGene 
and the Human Cell Atlas. Cell2Sentence is an innovative approach for adapting large language models (LLMs) to 
single-cell biology by transforming scRNA-seq data into "cell sentences"—sequences of gene names ordered by 
expression levels. This transformation enables LLMs to leverage their natural language processing capabilities for 
various single-cell tasks, with a focus on cell type prediction in this model.

# Training Data
This model was trained on over 57 million human and mouse cells gathered from over 800 single-cell RNA sequencing 
datasets from CellxGene and the Human Cell Atlas. This dataset covers a broad range of cell types and conditions
from multiple tissues in both human and mouse.

This model was trained with the top 200 genes per cell sentence.

# Tasks
This model is designed for:
- Cell type prediction: Predicting the cell type based on the "cell sentence" generated from scRNA-seq data.

# Cell2Sentence Links
- GitHub: https://github.com/vandijklab/cell2sentence
- Paper: https://www.biorxiv.org/content/10.1101/2023.09.11.557287v3

# Pythia Links
- Paper: https://arxiv.org/pdf/2304.01373
- Hugging Face: https://huggingface.co/EleutherAI/pythia-410m