Christina Theodoris commited on Jan 15, 2024

Commit

2a0dcbe

1 Parent(s): 10d3f10

add sphinx docs

Files changed (18) hide show

README.md +9 -7
docs/Makefile +20 -0
docs/make.bat +35 -0
docs/source/_static/css/custom.css +34 -0
docs/source/_static/gf_logo.png +0 -0
docs/source/about.rst +45 -0
docs/source/api.rst +35 -0
docs/source/conf.py +80 -0
docs/source/geneformer.emb_extractor.rst +26 -0
docs/source/geneformer.in_silico_perturber.rst +8 -0
docs/source/geneformer.in_silico_perturber_stats.rst +25 -0
docs/source/geneformer.tokenizer.rst +14 -0
docs/source/getstarted.rst +36 -0
docs/source/index.rst +16 -0
geneformer/emb_extractor.py +124 -98
geneformer/in_silico_perturber.py +103 -96
geneformer/in_silico_perturber_stats.py +99 -91
geneformer/tokenizer.py +36 -16

README.md CHANGED Viewed

@@ -3,28 +3,30 @@ datasets: ctheodoris/Genecorpus-30M
 license: apache-2.0
 ---
 # Geneformer
-Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
 See [our manuscript](https://rdcu.be/ddrx0) for details.
 # Model Description
-Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.
 The rank value encoding of each single cell’s transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.
-We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).
-During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the model’s attention weights in a completely self-supervised manner. Fine-tuning Geneformer towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modeling with limited patient data, Geneformer identified candidate therapeutic targets. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
 In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
 # Application
 The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
 Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:
 *Fine-tuning*:
-- transcription factor dosage sensitivity
 - chromatin dynamics (bivalently marked promoters)
 - transcription factor regulatory range
 - gene network centrality
@@ -64,6 +66,6 @@ For usage, see [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main
 - extracting and plotting cell embeddings
 - in silico perturbation
-Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.
-Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).

 license: apache-2.0
 ---
 # Geneformer
+Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
 See [our manuscript](https://rdcu.be/ddrx0) for details.
 # Model Description
+Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.
 The rank value encoding of each single cell’s transcriptome then proceeds through six transformer encoder units. Pretraining was accomplished using a masked learning objective where 15% of the genes within each transcriptome were masked and the model was trained to predict which gene should be within each masked position in that specific cell state using the context of the remaining unmasked genes. A major strength of this approach is that it is entirely self-supervised and can be accomplished on completely unlabeled data, which allows the inclusion of large amounts of training data without being restricted to samples with accompanying labels.
+We detail applications and results in [our manuscript](https://rdcu.be/ddrx0).
+During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the model’s attention weights in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.
 In [our manuscript](https://rdcu.be/ddrx0), we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within this repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
+Both the 6 and 12 layer Geneformer models were pretrained in June 2021.
 # Application
 The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
 Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:
 *Fine-tuning*:
+- transcription factor dosage sensitivity
 - chromatin dynamics (bivalently marked promoters)
 - transcription factor regulatory range
 - gene network centrality
 - extracting and plotting cell embeddings
 - in silico perturbation
+Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.
+Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).

docs/Makefile ADDED Viewed

	@@ -0,0 +1,20 @@

+# Minimal makefile for Sphinx documentation
+#
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+.PHONY: help Makefile
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/make.bat ADDED Viewed

	@@ -0,0 +1,35 @@

+@ECHO OFF
+pushd %~dp0
+REM Command file for Sphinx documentation
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+if "%1" == "" goto help
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+:end
+popd

docs/source/_static/css/custom.css ADDED Viewed

	@@ -0,0 +1,34 @@

+.wy-side-nav-search, .wy-nav-top {
+    background: linear-gradient(15deg, #13547a 0%, #80d0c7 100%);
+}
+/* unvisited link */
+.wy-nav-content a:link {
+  color: #067abd;
+}
+/* visited link */
+.wy-nav-content a:visited {
+  color: #4b827c;
+}
+/* mouse over link */
+.wy-nav-content a:hover {
+  color: #80d0c7;
+}
+/* selected link */
+.wy-nav-content a:active {
+  color: #4b827c;
+}
+.sig.sig-object {
+    padding: 5px 5px 5px 5px;
+    background-color: #e6e6e6;
+    border-style: solid;
+    border-color: black;
+    border-width: 1px 0;
+}

docs/source/_static/gf_logo.png ADDED Viewed

docs/source/about.rst ADDED Viewed

	@@ -0,0 +1,45 @@

+About
+=====
+Model Description
+-----------------
+**Geneformer** is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of ~30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. In silico perturbation with zero-shot learning identified a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. In silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an iPSC model of the disease. Overall, Geneformer represents a foundational deep learning model pretrained on ~30 million human single cell transcriptomes to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.
+In `our manuscript <https://rdcu.be/ddrx0>`_, we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within the repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
+Both the 6 and 12 layer Geneformer models were pretrained in June 2021.
+Application
+-----------
+The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
+Example applications demonstrated in `our manuscript <https://rdcu.be/ddrx0>`_ include:
+| *Fine-tuning*:
+| - transcription factor dosage sensitivity
+| - chromatin dynamics (bivalently marked promoters)
+| - transcription factor regulatory range
+| - gene network centrality
+| - transcription factor targets
+| - cell type annotation
+| - batch integration
+| - cell state classification across differentiation
+| - disease classification
+| - in silico perturbation to determine disease-driving genes
+| - in silico treatment to determine candidate therapeutic targets
+| *Zero-shot learning*:
+| - batch integration
+| - gene context specificity
+| - in silico reprogramming
+| - in silico differentiation
+| - in silico perturbation to determine impact on cell state
+| - in silico perturbation to determine transcription factor targets
+| - in silico perturbation to determine transcription factor cooperativity
+Citation
+--------
+| C V Theodoris #, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor #. `Transfer learning enables predictions in network biology. <https://rdcu.be/ddrx0>`_ *Nature*, 31 May 2023. (# co-corresponding authors)

docs/source/api.rst ADDED Viewed

	@@ -0,0 +1,35 @@

+API
+===
+Tokenizer
+---------
+.. toctree::
+   :maxdepth: 1
+   geneformer.tokenizer
+Embedding Extractor
+-------------------
+.. toctree::
+   :maxdepth: 1
+   geneformer.emb_extractor
+In Silico Perturber
+-------------------
+.. toctree::
+   :maxdepth: 1
+   geneformer.in_silico_perturber
+In Silico Perturber Stats
+-------------------------
+.. toctree::
+   :maxdepth: 1
+   geneformer.in_silico_perturber_stats

docs/source/conf.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+import pathlib
+import re
+import sys
+from sphinx.ext import autodoc
+sys.path.insert(0, pathlib.Path(__file__).parents[2].resolve().as_posix())
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+project = "geneformer"
+copyright = "2024, Christina Theodoris"
+author = "Christina Theodoris"
+release = "0.1.0"
+repository_url = "https://huggingface.co/ctheodoris/Geneformer"
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+extensions = [
+    "sphinx.ext.autodoc",
+    "sphinx.ext.autosummary",
+    "nbsphinx",
+    "sphinx.ext.viewcode",
+    "sphinx.ext.doctest",
+]
+templates_path = ["_templates"]
+exclude_patterns = [
+    "**.ipynb_checkpoints",
+]
+autoclass_content = "both"
+class MockedClassDocumenter(autodoc.ClassDocumenter):
+    def add_line(self, line: str, source: str, *lineno: int) -> None:
+        if line == "   Bases: :py:class:`object`":
+            return
+        super().add_line(line, source, *lineno)
+autodoc.ClassDocumenter = MockedClassDocumenter
+add_module_names = False
+def process_signature(app, what, name, obj, options, signature, return_annotation):
+    # loop through each line in the docstring and replace path with
+    # the generic path text
+    signature = re.sub(r"PosixPath\(.*?\)", "FILEPATH", signature)
+    return (signature, None)
+def setup(app):
+    app.connect("autodoc-process-signature", process_signature)
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+html_theme = "sphinx_rtd_theme"
+html_show_sphinx = False
+html_static_path = ["_static"]
+html_logo = "_static/gf_logo.png"
+html_theme_options = {
+    "collapse_navigation": False,
+    "sticky_navigation": True,
+    "navigation_depth": 3,
+    "logo_only": True,
+}
+html_css_files = [
+    "css/custom.css",
+]
+html_show_sourcelink = False

docs/source/geneformer.emb_extractor.rst ADDED Viewed

	@@ -0,0 +1,26 @@

+geneformer.emb\_extractor
+=========================
+.. automodule:: geneformer.emb_extractor
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :exclude-members:
+       accumulate_tdigests,
+       gen_heatmap_class_colors,
+       gen_heatmap_class_dict,
+       get_embs,
+       label_cell_embs,
+       label_gene_embs,
+       make_colorbar,
+       plot_heatmap,
+       plot_umap,
+       summarize_gene_embs,
+       tdigest_mean,
+       tdigest_median,
+       test_emb,
+       update_tdigest_dict,
+       update_tdigest_dict_mean,
+       update_tdigest_dict_median,
+       valid_option_dict,
+       validate_options

docs/source/geneformer.in_silico_perturber.rst ADDED Viewed

	@@ -0,0 +1,8 @@

+geneformer.in\_silico\_perturber
+=======================================
+.. automodule:: geneformer.in_silico_perturber
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :exclude-members:  valid_option_dict, validate_options, apply_additional_filters, isp_perturb_all, isp_perturb_set, update_perturbation_dictionary

docs/source/geneformer.in_silico_perturber_stats.rst ADDED Viewed

	@@ -0,0 +1,25 @@

+geneformer.in\_silico\_perturber\_stats
+==============================================
+.. automodule:: geneformer.in_silico_perturber_stats
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :exclude-members:
+        find,
+        get_fdr,
+        get_gene_list,
+        get_impact_component,
+        invert_dict,
+        isp_aggregate_gene_shifts,
+        isp_aggregate_grouped_perturb,
+        isp_stats_mixture_model,
+        isp_stats_to_goal_state,
+        isp_stats_vs_null,
+        n_detections,
+        read_dict,
+        read_dictionaries,
+        token_to_gene_name,
+        token_tuple_to_ensembl_ids,
+        valid_option_dict,
+        validate_options

docs/source/geneformer.tokenizer.rst ADDED Viewed

	@@ -0,0 +1,14 @@

+geneformer.tokenizer
+====================
+.. automodule:: geneformer.tokenizer
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :exclude-members:
+        create_dataset,
+        tokenize_anndata,
+        tokenize_files,
+        tokenize_loom,
+        rank_genes,
+        tokenize_cell

docs/source/getstarted.rst ADDED Viewed

	@@ -0,0 +1,36 @@

+Getting Started
+===============
+Installation
+------------
+Geneformer installation instructions.
+Make sure you have git-lfs installed (https://git-lfs.com).
+.. code-block:: bash
+    git lfs install
+    git clone https://huggingface.co/ctheodoris/Geneformer
+    cd Geneformer
+    pip install .
+Tutorials
+---------
+| See `examples <https://huggingface.co/ctheodoris/Geneformer/tree/main/examples>`_ for:
+| - tokenizing transcriptomes
+| - pretraining
+| - hyperparameter tuning
+| - fine-tuning
+| - extracting and plotting cell embeddings
+| - in silico perturbation
+Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the `example_input_files directory <https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files>`_ in the dataset repository, but these only represent a few example fine-tuning applications.
+Tips
+----
+Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).

docs/source/index.rst ADDED Viewed

	@@ -0,0 +1,16 @@

+Geneformer
+==========
+Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in network biology.
+See `our manuscript <https://rdcu.be/ddrx0>`_ for details.
+Table of Contents
+-----------------
+.. toctree::
+   :maxdepth: 2
+   about
+   getstarted
+   api

geneformer/emb_extractor.py CHANGED Viewed

@@ -1,29 +1,11 @@
 """
 Geneformer embedding extractor.
-Usage:
-  from geneformer import EmbExtractor
-  embex = EmbExtractor(model_type="CellClassifier",
-                       num_classes=3,
-                       emb_mode="cell",
-                       cell_emb_style="mean_pool",
-                       gene_emb_style="mean_pool",
-                       filter_data={"cell_type":["cardiomyocyte"]},
-                       max_ncells=1000,
-                       max_ncells_to_plot=1000,
-                       emb_layer=-1,
-                       emb_label=["disease","cell_type"],
-                       labels_to_plot=["disease","cell_type"],
-                       nproc=16,
-                       summary_stat=None)
-  embs = embex.extract_embs("path/to/model",
-                            "path/to/input_data",
-                            "path/to/output_directory",
-                            "output_prefix")
-  embex.plot_embs(embs=embs,
-                  plot_style="heatmap",
-                  output_directory="path/to/output_directory",
-                  output_prefix="output_prefix")
 """
@@ -414,51 +396,69 @@ class EmbExtractor:
         Initialize embedding extractor.
         Parameters
-        ----------
         model_type : {"Pretrained","GeneClassifier","CellClassifier"}
-            Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
         num_classes : int
-            If model is a gene or cell classifier, specify number of classes it was trained to classify.
-            For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
         emb_mode : {"cell","gene"}
-            Whether to output cell or gene embeddings.
         cell_emb_style : "mean_pool"
-            Method for summarizing cell embeddings.
-            Currently only option is mean pooling of gene embeddings for given cell.
         gene_emb_style : "mean_pool"
-            Method for summarizing gene embeddings.
-            Currently only option is mean pooling of contextual gene embeddings for given gene.
         filter_data : None, dict
-            Default is to extract embeddings from all input data.
-            Otherwise, dictionary specifying .dataset column name and list of values to filter by.
         max_ncells : None, int
-            Maximum number of cells to extract embeddings from.
-            Default is 1000 cells randomly sampled from input data.
-            If None, will extract embeddings from all cells.
         emb_layer : {-1, 0}
-            Embedding layer to extract.
-            The last layer is most specifically weighted to optimize the given learning objective.
-            Generally, it is best to extract the 2nd to last layer to get a more general representation.
-            -1: 2nd to last layer
-            0: last layer
         emb_label : None, list
-            List of column name(s) in .dataset to add as labels to embedding output.
         labels_to_plot : None, list
-            Cell labels to plot.
-            Shown as color bar in heatmap.
-            Shown as cell color in umap.
-            Plotting umap requires labels to plot.
         forward_batch_size : int
-            Batch size for forward pass.
         nproc : int
-            Number of CPU processes to use.
         summary_stat : {None, "mean", "median", "exact_mean", "exact_median"}
-            If exact_mean or exact_median, outputs only exact mean or median embedding of input data.
-            If mean or median, outputs only approximated mean or median embedding of input data.
-            Non-exact recommended if encountering memory constraints while generating goal embedding positions.
-            Non-exact is slower but more memory-efficient.
         token_dictionary_file : Path
-            Path to pickle file containing token dictionary (Ensembl ID:token).
         """
         self.model_type = model_type
@@ -533,20 +533,32 @@ class EmbExtractor:
         Extract embeddings from input data and save as results in output_directory.
         Parameters
-        ----------
         model_directory : Path
-            Path to directory containing model
         input_data_file : Path
-            Path to directory containing .dataset inputs
         output_directory : Path
-            Path to directory where embedding data will be saved as csv
         output_prefix : str
-            Prefix for output file
         output_torch_embs : bool
-            Whether or not to also output the embeddings as a tensor.
-            Note, if true, will output embeddings as both dataframe and tensor.
         cell_state : dict
-            Cell state key and value for state embedding extraction.
         """
         filtered_input_data = pu.load_and_filter(
@@ -618,41 +630,43 @@ class EmbExtractor:
         Extract exact mean or exact median cell state embedding positions from input data and save as results in output_directory.
         Parameters
-        ----------
         cell_states_to_model : None, dict
-            Cell states to model if testing perturbations that achieve goal state change.
-            Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
-            state_key: key specifying name of column in .dataset that defines the start/goal states
-            start_state: value in the state_key column that specifies the start state
-            goal_state: value in the state_key column taht specifies the goal end state
-            alt_states: list of values in the state_key column that specify the alternate end states
-            For example: {"state_key": "disease",
-                          "start_state": "dcm",
-                          "goal_state": "nf",
-                          "alt_states": ["hcm", "other1", "other2"]}
         model_directory : Path
-            Path to directory containing model
         input_data_file : Path
-            Path to directory containing .dataset inputs
         output_directory : Path
-            Path to directory where embedding data will be saved as csv
         output_prefix : str
-            Prefix for output file
         output_torch_embs : bool
-            Whether or not to also output the embeddings as a tensor.
-            Note, if true, will output embeddings as both dataframe and tensor.
         Outputs
-        ----------
-        Outputs state_embs_dict for use with in silico perturber.
-        Format is dictionary of embedding positions of each cell state to model shifts from/towards.
-        Keys specify each possible cell state to model.
-        Values are target embedding positions as torch.tensor.
-        For example: {"nf": emb_nf,
-                      "hcm": emb_hcm,
-                      "dcm": emb_dcm,
-                      "other1": emb_other1,
-                      "other2": emb_other2}
         """
         pu.validate_cell_states_to_model(cell_states_to_model)
@@ -708,21 +722,33 @@ class EmbExtractor:
         Plot embeddings, coloring by provided labels.
         Parameters
-        ----------
         embs : pandas.core.frame.DataFrame
-            Pandas dataframe containing embeddings output from extract_embs
         plot_style : str
-            Style of plot: "heatmap" or "umap"
         output_directory : Path
-            Path to directory where plots will be saved as pdf
         output_prefix : str
-            Prefix for output file
         max_ncells_to_plot : None, int
-            Maximum number of cells to plot.
-            Default is 1000 cells randomly sampled from embeddings.
-            If None, will plot embeddings from all cells.
         kwargs_dict : dict
-            Dictionary of kwargs to pass to plotting function.
         """
         if plot_style not in ["heatmap", "umap"]:

 """
 Geneformer embedding extractor.
+**Description:**
+| Extracts gene or cell embeddings.
+| Plots cell embeddings as heatmaps or UMAPs.
+| Generates cell state embedding dictionary for use with InSilicoPerturber.
 """
         Initialize embedding extractor.
         Parameters
+        ~~~~~~~~~~
         model_type : {"Pretrained","GeneClassifier","CellClassifier"}
+            | Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
         num_classes : int
+            | If model is a gene or cell classifier, specify number of classes it was trained to classify.
+            | For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
         emb_mode : {"cell","gene"}
+            | Whether to output cell or gene embeddings.
         cell_emb_style : "mean_pool"
+            | Method for summarizing cell embeddings.
+            | Currently only option is mean pooling of gene embeddings for given cell.
         gene_emb_style : "mean_pool"
+            | Method for summarizing gene embeddings.
+            | Currently only option is mean pooling of contextual gene embeddings for given gene.
         filter_data : None, dict
+            | Default is to extract embeddings from all input data.
+            | Otherwise, dictionary specifying .dataset column name and list of values to filter by.
         max_ncells : None, int
+            | Maximum number of cells to extract embeddings from.
+            | Default is 1000 cells randomly sampled from input data.
+            | If None, will extract embeddings from all cells.
         emb_layer : {-1, 0}
+            | Embedding layer to extract.
+            | The last layer is most specifically weighted to optimize the given learning objective.
+            | Generally, it is best to extract the 2nd to last layer to get a more general representation.
+            | -1: 2nd to last layer
+            | 0: last layer
         emb_label : None, list
+            | List of column name(s) in .dataset to add as labels to embedding output.
         labels_to_plot : None, list
+            | Cell labels to plot.
+            | Shown as color bar in heatmap.
+            | Shown as cell color in umap.
+            | Plotting umap requires labels to plot.
         forward_batch_size : int
+            | Batch size for forward pass.
         nproc : int
+            | Number of CPU processes to use.
         summary_stat : {None, "mean", "median", "exact_mean", "exact_median"}
+            | If exact_mean or exact_median, outputs only exact mean or median embedding of input data.
+            | If mean or median, outputs only approximated mean or median embedding of input data.
+            | Non-exact recommended if encountering memory constraints while generating goal embedding positions.
+            | Non-exact is slower but more memory-efficient.
         token_dictionary_file : Path
+            | Path to pickle file containing token dictionary (Ensembl ID:token).
+        Examples
+        ~~~~~~~~
+        .. code-block :: python
+            >>> from geneformer import EmbExtractor
+            >>> embex = EmbExtractor(model_type="CellClassifier",
+                ...    num_classes=3,
+                ...    emb_mode="cell",
+                ...    filter_data={"cell_type":["cardiomyocyte"]},
+                ...    max_ncells=1000,
+                ...    max_ncells_to_plot=1000,
+                ...    emb_layer=-1,
+                ...    emb_label=["disease","cell_type"],
+                ...    labels_to_plot=["disease","cell_type"],
+                ... )
         """
         self.model_type = model_type
         Extract embeddings from input data and save as results in output_directory.
         Parameters
+        ~~~~~~~~~~
         model_directory : Path
+            | Path to directory containing model
         input_data_file : Path
+            | Path to directory containing .dataset inputs
         output_directory : Path
+            | Path to directory where embedding data will be saved as csv
         output_prefix : str
+            | Prefix for output file
         output_torch_embs : bool
+            | Whether or not to also output the embeddings as a tensor.
+            | Note, if true, will output embeddings as both dataframe and tensor.
         cell_state : dict
+            | Cell state key and value for state embedding extraction.
+        Examples
+        ~~~~~~~~
+        .. code-block :: python
+            >>> embs = embex.extract_embs("path/to/model",
+                ...    "path/to/input_data",
+                ...    "path/to/output_directory",
+                ...    "output_prefix",
+                ... )
         """
         filtered_input_data = pu.load_and_filter(
         Extract exact mean or exact median cell state embedding positions from input data and save as results in output_directory.
         Parameters
+        ~~~~~~~~~~
         cell_states_to_model : None, dict
+            | Cell states to model if testing perturbations that achieve goal state change.
+            | Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
+            | state_key: key specifying name of column in .dataset that defines the start/goal states
+            | start_state: value in the state_key column that specifies the start state
+            | goal_state: value in the state_key column taht specifies the goal end state
+            | alt_states: list of values in the state_key column that specify the alternate end states
+            | For example:
+            |      {"state_key": "disease",
+            |      "start_state": "dcm",
+            |      "goal_state": "nf",
+            |      "alt_states": ["hcm", "other1", "other2"]}
         model_directory : Path
+            | Path to directory containing model
         input_data_file : Path
+            | Path to directory containing .dataset inputs
         output_directory : Path
+            | Path to directory where embedding data will be saved as csv
         output_prefix : str
+            | Prefix for output file
         output_torch_embs : bool
+            | Whether or not to also output the embeddings as a tensor.
+            | Note, if true, will output embeddings as both dataframe and tensor.
         Outputs
+        ~~~~~~~
+        | Outputs state_embs_dict for use with in silico perturber.
+        | Format is dictionary of embedding positions of each cell state to model shifts from/towards.
+        | Keys specify each possible cell state to model.
+        | Values are target embedding positions as torch.tensor.
+        | For example:
+        |      {"nf": emb_nf,
+        |      "hcm": emb_hcm,
+        |      "dcm": emb_dcm,
+        |      "other1": emb_other1,
+        |      "other2": emb_other2}
         """
         pu.validate_cell_states_to_model(cell_states_to_model)
         Plot embeddings, coloring by provided labels.
         Parameters
+        ~~~~~~~~~~
         embs : pandas.core.frame.DataFrame
+            | Pandas dataframe containing embeddings output from extract_embs
         plot_style : str
+            | Style of plot: "heatmap" or "umap"
         output_directory : Path
+            | Path to directory where plots will be saved as pdf
         output_prefix : str
+            | Prefix for output file
         max_ncells_to_plot : None, int
+            | Maximum number of cells to plot.
+            | Default is 1000 cells randomly sampled from embeddings.
+            | If None, will plot embeddings from all cells.
         kwargs_dict : dict
+            | Dictionary of kwargs to pass to plotting function.
+        Examples
+        ~~~~~~~~
+        .. code-block :: python
+            >>> embex.plot_embs(embs=embs,
+                ...    plot_style="heatmap",
+                ...    output_directory="path/to/output_directory",
+                ...    output_prefix="output_prefix",
+                ... )
         """
         if plot_style not in ["heatmap", "umap"]:

geneformer/in_silico_perturber.py CHANGED Viewed

@@ -1,28 +1,35 @@
 """
 Geneformer in silico perturber.
-Usage:
-  from geneformer import InSilicoPerturber
-  isp = InSilicoPerturber(perturb_type="delete",
-                          perturb_rank_shift=None,
-                          genes_to_perturb="all",
-                          combos=0,
-                          anchor_gene=None,
-                          model_type="CellClassifier",
-                          num_classes=0,
-                          emb_mode="cell",
-                          cell_emb_style="mean_pool",
-                          filter_data={"cell_type":["cardiomyocyte"]},
-                          cell_states_to_model={"state_key": "disease", "start_state": "dcm", "goal_state": "nf", "alt_states": ["hcm", "other1", "other2"]},
-                          state_embs_dict ={"nf": emb_nf, "hcm": emb_hcm, "dcm": emb_dcm, "other1": emb_other1, "other2": emb_other2},
-                          max_ncells=None,
-                          emb_layer=0,
-                          forward_batch_size=100,
-                          nproc=16)
-  isp.perturb_data("path/to/model",
-                   "path/to/input_data",
-                   "path/to/output_directory",
-                   "output_prefix")
 """
 import logging
@@ -94,89 +101,89 @@ class InSilicoPerturber:
         Initialize in silico perturber.
         Parameters
-        ----------
-        perturb_type : {"delete","overexpress","inhibit","activate"}
-            Type of perturbation.
-            "delete": delete gene from rank value encoding
-            "overexpress": move gene to front of rank value encoding
-            "inhibit": move gene to lower quartile of rank value encoding
-            "activate": move gene to higher quartile of rank value encoding
-        perturb_rank_shift : None, {1,2,3}
-            Number of quartiles by which to shift rank of gene.
-            For example, if perturb_type="activate" and perturb_rank_shift=1:
-                genes in 4th quartile will move to middle of 3rd quartile.
-                genes in 3rd quartile will move to middle of 2nd quartile.
-                genes in 2nd quartile will move to middle of 1st quartile.
-                genes in 1st quartile will move to front of rank value encoding.
-            For example, if perturb_type="inhibit" and perturb_rank_shift=2:
-                genes in 1st quartile will move to middle of 3rd quartile.
-                genes in 2nd quartile will move to middle of 4th quartile.
-                genes in 3rd or 4th quartile will move to bottom of rank value encoding.
         genes_to_perturb : "all", list
-            Default is perturbing each gene detected in each cell in the dataset.
-            Otherwise, may provide a list of ENSEMBL IDs of genes to perturb.
-            If gene list is provided, then perturber will only test perturbing them all together
-            (rather than testing each possible combination of the provided genes).
         combos : {0,1}
-            Whether to perturb genes individually (0) or in pairs (1).
         anchor_gene : None, str
-            ENSEMBL ID of gene to use as anchor in combination perturbations.
-            For example, if combos=1 and anchor_gene="ENSG00000148400":
-                anchor gene will be perturbed in combination with each other gene.
-        model_type : {"Pretrained","GeneClassifier","CellClassifier"}
-            Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
         num_classes : int
-            If model is a gene or cell classifier, specify number of classes it was trained to classify.
-            For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
-        emb_mode : {"cell","cell_and_gene"}
-            Whether to output impact of perturbation on cell and/or gene embeddings.
-            Gene embedding shifts only available as compared to original cell, not comparing to goal state.
         cell_emb_style : "mean_pool"
-            Method for summarizing cell embeddings.
-            Currently only option is mean pooling of gene embeddings for given cell.
         filter_data : None, dict
-            Default is to use all input data for in silico perturbation study.
-            Otherwise, dictionary specifying .dataset column name and list of values to filter by.
         cell_states_to_model : None, dict
-            Cell states to model if testing perturbations that achieve goal state change.
-            Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
-            state_key: key specifying name of column in .dataset that defines the start/goal states
-            start_state: value in the state_key column that specifies the start state
-            goal_state: value in the state_key column taht specifies the goal end state
-            alt_states: list of values in the state_key column that specify the alternate end states
-            For example: {"state_key": "disease",
-                          "start_state": "dcm",
-                          "goal_state": "nf",
-                          "alt_states": ["hcm", "other1", "other2"]}
         state_embs_dict : None, dict
-            Embedding positions of each cell state to model shifts from/towards (e.g. mean or median).
-            Dictionary with keys specifying each possible cell state to model.
-            Values are target embedding positions as torch.tensor.
-            For example: {"nf": emb_nf,
-                          "hcm": emb_hcm,
-                          "dcm": emb_dcm,
-                          "other1": emb_other1,
-                          "other2": emb_other2}
         max_ncells : None, int
-            Maximum number of cells to test.
-            If None, will test all cells.
         cell_inds_to_perturb : "all", list
-            Default is perturbing each cell in the dataset.
-            Otherwise, may provide a dict of indices of cells to perturb with keys start_ind and end_ind.
-            start_ind: the first index to perturb.
-            end_ind: the last index to perturb (exclusive).
-            Indices will be selected *after* the filter_data criteria and sorting.
-            Useful for splitting extremely large datasets across separate GPUs.
         emb_layer : {-1, 0}
-            Embedding layer to use for quantification.
-            0: last layer (recommended for questions closely tied to model's training objective)
-            -1: 2nd to last layer (recommended for questions requiring more general representations)
         forward_batch_size : int
-            Batch size for forward pass.
         nproc : int
-            Number of CPU processes to use.
         token_dictionary_file : Path
-            Path to pickle file containing token dictionary (Ensembl ID:token).
         """
         self.perturb_type = perturb_type
@@ -392,15 +399,15 @@ class InSilicoPerturber:
         Perturb genes in input data and save as results in output_directory.
         Parameters
-        ----------
         model_directory : Path
-            Path to directory containing model
         input_data_file : Path
-            Path to directory containing .dataset inputs
         output_directory : Path
-            Path to directory where perturbation data will be saved as batched pickle files
         output_prefix : str
-            Prefix for output files
         """
         ### format output path ###

 """
 Geneformer in silico perturber.
+**Usage:**
+.. code-block :: python
+    >>> from geneformer import InSilicoPerturber
+    >>> isp = InSilicoPerturber(perturb_type="delete",
+        ...                     perturb_rank_shift=None,
+        ...                     genes_to_perturb="all",
+        ...                     model_type="CellClassifier",
+        ...                     num_classes=0,
+        ...                     emb_mode="cell",
+        ...                     filter_data={"cell_type":["cardiomyocyte"]},
+        ...                     cell_states_to_model={"state_key": "disease", "start_state": "dcm", "goal_state": "nf", "alt_states": ["hcm", "other1", "other2"]},
+        ...                     state_embs_dict ={"nf": emb_nf, "hcm": emb_hcm, "dcm": emb_dcm, "other1": emb_other1, "other2": emb_other2},
+        ...                     max_ncells=None,
+        ...                     emb_layer=0,
+        ...                     forward_batch_size=100,
+        ...                     nproc=16)
+    >>> isp.perturb_data("path/to/model",
+        ...              "path/to/input_data",
+        ...              "path/to/output_directory",
+        ...              "output_prefix")
+**Description:**
+| Performs in silico perturbation (e.g. deletion or overexpression) of defined set of genes or all genes in sample of cells.
+| Outputs impact of perturbation on cell or gene embeddings.
+| Output files are analyzed with ``in_silico_perturber_stats``.
 """
 import logging
         Initialize in silico perturber.
         Parameters
+        ~~~~~~~~~~
+        perturb_type : {"delete", "overexpress", "inhibit", "activate"}
+            | Type of perturbation.
+            | "delete": delete gene from rank value encoding
+            | "overexpress": move gene to front of rank value encoding
+            | *(TBA)* "inhibit": move gene to lower quartile of rank value encoding
+            | *(TBA)* "activate": move gene to higher quartile of rank value encoding
+        *(TBA)* perturb_rank_shift : None, {1,2,3}
+            | Number of quartiles by which to shift rank of gene.
+            | For example, if perturb_type="activate" and perturb_rank_shift=1:
+            |     genes in 4th quartile will move to middle of 3rd quartile.
+            |     genes in 3rd quartile will move to middle of 2nd quartile.
+            |     genes in 2nd quartile will move to middle of 1st quartile.
+            |     genes in 1st quartile will move to front of rank value encoding.
+            | For example, if perturb_type="inhibit" and perturb_rank_shift=2:
+            |     genes in 1st quartile will move to middle of 3rd quartile.
+            |     genes in 2nd quartile will move to middle of 4th quartile.
+            |     genes in 3rd or 4th quartile will move to bottom of rank value encoding.
         genes_to_perturb : "all", list
+            | Default is perturbing each gene detected in each cell in the dataset.
+            | Otherwise, may provide a list of ENSEMBL IDs of genes to perturb.
+            | If gene list is provided, then perturber will only test perturbing them all together
+            | (rather than testing each possible combination of the provided genes).
         combos : {0,1}
+            | Whether to perturb genes individually (0) or in pairs (1).
         anchor_gene : None, str
+            | ENSEMBL ID of gene to use as anchor in combination perturbations.
+            | For example, if combos=1 and anchor_gene="ENSG00000148400":
+            |     anchor gene will be perturbed in combination with each other gene.
+        model_type : {"Pretrained", "GeneClassifier", "CellClassifier"}
+            | Whether model is the pretrained Geneformer or a fine-tuned gene or cell classifier.
         num_classes : int
+            | If model is a gene or cell classifier, specify number of classes it was trained to classify.
+            | For the pretrained Geneformer model, number of classes is 0 as it is not a classifier.
+        emb_mode : {"cell", "cell_and_gene"}
+            | Whether to output impact of perturbation on cell and/or gene embeddings.
+            | Gene embedding shifts only available as compared to original cell, not comparing to goal state.
         cell_emb_style : "mean_pool"
+            | Method for summarizing cell embeddings.
+            | Currently only option is mean pooling of gene embeddings for given cell.
         filter_data : None, dict
+            | Default is to use all input data for in silico perturbation study.
+            | Otherwise, dictionary specifying .dataset column name and list of values to filter by.
         cell_states_to_model : None, dict
+            | Cell states to model if testing perturbations that achieve goal state change.
+            | Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
+            | state_key: key specifying name of column in .dataset that defines the start/goal states
+            | start_state: value in the state_key column that specifies the start state
+            | goal_state: value in the state_key column taht specifies the goal end state
+            | alt_states: list of values in the state_key column that specify the alternate end states
+            | For example: {"state_key": "disease",
+            |               "start_state": "dcm",
+            |               "goal_state": "nf",
+            |               "alt_states": ["hcm", "other1", "other2"]}
         state_embs_dict : None, dict
+            | Embedding positions of each cell state to model shifts from/towards (e.g. mean or median).
+            | Dictionary with keys specifying each possible cell state to model.
+            | Values are target embedding positions as torch.tensor.
+            | For example: {"nf": emb_nf,
+            |               "hcm": emb_hcm,
+            |               "dcm": emb_dcm,
+            |               "other1": emb_other1,
+            |               "other2": emb_other2}
         max_ncells : None, int
+            | Maximum number of cells to test.
+            | If None, will test all cells.
         cell_inds_to_perturb : "all", list
+            | Default is perturbing each cell in the dataset.
+            | Otherwise, may provide a dict of indices of cells to perturb with keys start_ind and end_ind.
+            | start_ind: the first index to perturb.
+            | end_ind: the last index to perturb (exclusive).
+            | Indices will be selected *after* the filter_data criteria and sorting.
+            | Useful for splitting extremely large datasets across separate GPUs.
         emb_layer : {-1, 0}
+            | Embedding layer to use for quantification.
+            | 0: last layer (recommended for questions closely tied to model's training objective)
+            | -1: 2nd to last layer (recommended for questions requiring more general representations)
         forward_batch_size : int
+            | Batch size for forward pass.
         nproc : int
+            | Number of CPU processes to use.
         token_dictionary_file : Path
+            | Path to pickle file containing token dictionary (Ensembl ID:token).
         """
         self.perturb_type = perturb_type
         Perturb genes in input data and save as results in output_directory.
         Parameters
+        ~~~~~~~~~~
         model_directory : Path
+            | Path to directory containing model
         input_data_file : Path
+            | Path to directory containing .dataset inputs
         output_directory : Path
+            | Path to directory where perturbation data will be saved as batched pickle files
         output_prefix : str
+            | Prefix for output files
         """
         ### format output path ###

geneformer/in_silico_perturber_stats.py CHANGED Viewed

@@ -1,19 +1,27 @@
 """
 Geneformer in silico perturber stats generator.
-Usage:
-  from geneformer import InSilicoPerturberStats
-  ispstats = InSilicoPerturberStats(mode="goal_state_shift",
-                                    combos=0,
-                                    anchor_gene=None,
-                                    cell_states_to_model={"state_key": "disease",
-                                                          "start_state": "dcm",
-                                                          "goal_state": "nf",
-                                                          "alt_states": ["hcm", "other1", "other2"]})
-  ispstats.get_stats("path/to/input_data",
-                     None,
-                     "path/to/output_directory",
-                     "output_prefix")
 """
@@ -645,41 +653,41 @@ class InSilicoPerturberStats:
         Initialize in silico perturber stats generator.
         Parameters
-        ----------
-        mode : {"goal_state_shift","vs_null","mixture_model","aggregate_data","aggregate_gene_shifts"}
-            Type of stats.
-            "goal_state_shift": perturbation vs. random for desired cell state shift
-            "vs_null": perturbation vs. null from provided null distribution dataset
-            "mixture_model": perturbation in impact vs. no impact component of mixture model (no goal direction)
-            "aggregate_data": aggregates cosine shifts for single perturbation in multiple cells
-            "aggregate_gene_shifts": aggregates cosine shifts of genes in response to perturbation(s)
         genes_perturbed : "all", list
-            Genes perturbed in isp experiment.
-            Default is assuming genes_to_perturb in isp experiment was "all" (each gene in each cell).
-            Otherwise, may provide a list of ENSEMBL IDs of genes perturbed as a group all together.
         combos : {0,1,2}
-            Whether to perturb genes individually (0), in pairs (1), or in triplets (2).
         anchor_gene : None, str
-            ENSEMBL ID of gene to use as anchor in combination perturbations or in testing effect on downstream genes.
-            For example, if combos=1 and anchor_gene="ENSG00000136574":
-                analyzes data for anchor gene perturbed in combination with each other gene.
-            However, if combos=0 and anchor_gene="ENSG00000136574":
-                analyzes data for the effect of anchor gene's perturbation on the embedding of each other gene.
         cell_states_to_model: None, dict
-            Cell states to model if testing perturbations that achieve goal state change.
-            Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
-            state_key: key specifying name of column in .dataset that defines the start/goal states
-            start_state: value in the state_key column that specifies the start state
-            goal_state: value in the state_key column taht specifies the goal end state
-            alt_states: list of values in the state_key column that specify the alternate end states
-            For example: {"state_key": "disease",
-                          "start_state": "dcm",
-                          "goal_state": "nf",
-                          "alt_states": ["hcm", "other1", "other2"]}
         token_dictionary_file : Path
-            Path to pickle file containing token dictionary (Ensembl ID:token).
         gene_name_id_dictionary_file : Path
-            Path to pickle file containing gene name to ID dictionary (gene name:Ensembl ID).
         """
         self.mode = mode
@@ -847,64 +855,64 @@ class InSilicoPerturberStats:
         Get stats for in silico perturbation data and save as results in output_directory.
         Parameters
-        ----------
         input_data_directory : Path
-            Path to directory containing cos_sim dictionary inputs
         null_dist_data_directory : Path
-            Path to directory containing null distribution cos_sim dictionary inputs
         output_directory : Path
-            Path to directory where perturbation data will be saved as .csv
         output_prefix : str
-            Prefix for output .csv
         null_dict_list: dict
-            List of loaded null distribtion dictionary if more than one comparison vs. the null is to be performed
         Outputs
-        ----------
         Definition of possible columns in .csv output file.
-        Of note, not all columns will be present in all output files.
-        Some columns are specific to particular perturbation modes.
-        "Gene": gene token
-        "Gene_name": gene name
-        "Ensembl_ID": gene Ensembl ID
-        "N_Detections": number of cells in which each gene or gene combination was detected in the input dataset
-        "Sig": 1 if FDR<0.05, otherwise 0
-        "Shift_to_goal_end": cosine shift from start state towards goal end state in response to given perturbation
-        "Shift_to_alt_end": cosine shift from start state towards alternate end state in response to given perturbation
-        "Goal_end_vs_random_pval": pvalue of cosine shift from start state towards goal end state by Wilcoxon
-            pvalue compares shift caused by perturbing given gene compared to random genes
-        "Alt_end_vs_random_pval": pvalue of cosine shift from start state towards alternate end state by Wilcoxon
-            pvalue compares shift caused by perturbing given gene compared to random genes
-        "Goal_end_FDR": Benjamini-Hochberg correction of "Goal_end_vs_random_pval"
-        "Alt_end_FDR": Benjamini-Hochberg correction of "Alt_end_vs_random_pval"
-        "Test_avg_shift": cosine shift in response to given perturbation in cells from test distribution
-        "Null_avg_shift": cosine shift in response to given perturbation in cells from null distribution (e.g. random cells)
-        "Test_vs_null_avg_shift": difference in cosine shift in cells from test vs. null distribution
-            (i.e. "Test_avg_shift" minus "Null_avg_shift")
-        "Test_vs_null_pval": pvalue of cosine shift in test vs. null distribution
-        "Test_vs_null_FDR": Benjamini-Hochberg correction of "Test_vs_null_pval"
-        "N_Detections_test": "N_Detections" in cells from test distribution
-        "N_Detections_null": "N_Detections" in cells from null distribution
-        "Anchor_shift": cosine shift in response to given perturbation of anchor gene
-        "Test_token_shift": cosine shift in response to given perturbation of test gene
-        "Sum_of_indiv_shifts": sum of cosine shifts in response to individually perturbing test and anchor genes
-        "Combo_shift": cosine shift in response to given perturbation of both anchor and test gene(s) in combination
-        "Combo_minus_sum_shift": difference of cosine shifts in response combo perturbation vs. sum of individual perturbations
-            (i.e. "Combo_shift" minus "Sum_of_indiv_shifts")
-        "Impact_component": whether the given perturbation was modeled to be within the impact component by the mixture model
-            1: within impact component; 0: not within impact component
-        "Impact_component_percent": percent of cells in which given perturbation was modeled to be within impact component
-        In case of aggregating gene shifts:
-        "Perturbed": ID(s) of gene(s) being perturbed
-        "Affected": ID of affected gene or "cell_emb" indicating the impact on the cell embedding as a whole
-        "Cosine_shift_mean": mean of cosine shift of modeled perturbation on affected gene or cell
-        "Cosine_shift_stdev": standard deviation of cosine shift of modeled perturbation on affected gene or cell
         """
         if self.mode not in [

 """
 Geneformer in silico perturber stats generator.
+**Usage:**
+.. code-block :: python
+    >>> from geneformer import InSilicoPerturberStats
+    >>> ispstats = InSilicoPerturberStats(mode="goal_state_shift",
+        ...    cell_states_to_model={"state_key": "disease",
+        ...                          "start_state": "dcm",
+        ...                          "goal_state": "nf",
+        ...                          "alt_states": ["hcm", "other1", "other2"]})
+        ... )
+    >>> ispstats.get_stats("path/to/input_data",
+        ...                None,
+        ...                "path/to/output_directory",
+        ...                "output_prefix")
+**Description:**
+| Collates data or calculates stats for in silico perturbations based on type of statistics specified in InSilicoPerturberStats.
+| Input data is raw in silico perturbation results in the form of dictionaries outputted by ``in_silico_perturber``.
 """
         Initialize in silico perturber stats generator.
         Parameters
+        ~~~~~~~~~~
+        mode : {"goal_state_shift", "vs_null", "mixture_model", "aggregate_data", "aggregate_gene_shifts"}
+            | Type of stats.
+            | "goal_state_shift": perturbation vs. random for desired cell state shift
+            | "vs_null": perturbation vs. null from provided null distribution dataset
+            | "mixture_model": perturbation in impact vs. no impact component of mixture model (no goal direction)
+            | "aggregate_data": aggregates cosine shifts for single perturbation in multiple cells
+            | "aggregate_gene_shifts": aggregates cosine shifts of genes in response to perturbation(s)
         genes_perturbed : "all", list
+            | Genes perturbed in isp experiment.
+            | Default is assuming genes_to_perturb in isp experiment was "all" (each gene in each cell).
+            | Otherwise, may provide a list of ENSEMBL IDs of genes perturbed as a group all together.
         combos : {0,1,2}
+            | Whether to perturb genes individually (0), in pairs (1), or in triplets (2).
         anchor_gene : None, str
+            | ENSEMBL ID of gene to use as anchor in combination perturbations or in testing effect on downstream genes.
+            | For example, if combos=1 and anchor_gene="ENSG00000136574":
+            |    analyzes data for anchor gene perturbed in combination with each other gene.
+            | However, if combos=0 and anchor_gene="ENSG00000136574":
+            |    analyzes data for the effect of anchor gene's perturbation on the embedding of each other gene.
         cell_states_to_model: None, dict
+            | Cell states to model if testing perturbations that achieve goal state change.
+            | Four-item dictionary with keys: state_key, start_state, goal_state, and alt_states
+            | state_key: key specifying name of column in .dataset that defines the start/goal states
+            | start_state: value in the state_key column that specifies the start state
+            | goal_state: value in the state_key column taht specifies the goal end state
+            | alt_states: list of values in the state_key column that specify the alternate end states
+            | For example: {"state_key": "disease",
+            |               "start_state": "dcm",
+            |               "goal_state": "nf",
+            |               "alt_states": ["hcm", "other1", "other2"]}
         token_dictionary_file : Path
+            | Path to pickle file containing token dictionary (Ensembl ID:token).
         gene_name_id_dictionary_file : Path
+            | Path to pickle file containing gene name to ID dictionary (gene name:Ensembl ID).
         """
         self.mode = mode
         Get stats for in silico perturbation data and save as results in output_directory.
         Parameters
+        ~~~~~~~~~~
         input_data_directory : Path
+            | Path to directory containing cos_sim dictionary inputs
         null_dist_data_directory : Path
+            | Path to directory containing null distribution cos_sim dictionary inputs
         output_directory : Path
+            | Path to directory where perturbation data will be saved as .csv
         output_prefix : str
+            | Prefix for output .csv
         null_dict_list: dict
+            | List of loaded null distribtion dictionary if more than one comparison vs. the null is to be performed
         Outputs
+        ~~~~~~~
         Definition of possible columns in .csv output file.
+        | Of note, not all columns will be present in all output files.
+        | Some columns are specific to particular perturbation modes.
+        | "Gene": gene token
+        | "Gene_name": gene name
+        | "Ensembl_ID": gene Ensembl ID
+        | "N_Detections": number of cells in which each gene or gene combination was detected in the input dataset
+        | "Sig": 1 if FDR<0.05, otherwise 0
+        | "Shift_to_goal_end": cosine shift from start state towards goal end state in response to given perturbation
+        | "Shift_to_alt_end": cosine shift from start state towards alternate end state in response to given perturbation
+        | "Goal_end_vs_random_pval": pvalue of cosine shift from start state towards goal end state by Wilcoxon
+        |     pvalue compares shift caused by perturbing given gene compared to random genes
+        | "Alt_end_vs_random_pval": pvalue of cosine shift from start state towards alternate end state by Wilcoxon
+        |     pvalue compares shift caused by perturbing given gene compared to random genes
+        | "Goal_end_FDR": Benjamini-Hochberg correction of "Goal_end_vs_random_pval"
+        | "Alt_end_FDR": Benjamini-Hochberg correction of "Alt_end_vs_random_pval"
+        | "Test_avg_shift": cosine shift in response to given perturbation in cells from test distribution
+        | "Null_avg_shift": cosine shift in response to given perturbation in cells from null distribution (e.g. random cells)
+        | "Test_vs_null_avg_shift": difference in cosine shift in cells from test vs. null distribution
+        |     (i.e. "Test_avg_shift" minus "Null_avg_shift")
+        | "Test_vs_null_pval": pvalue of cosine shift in test vs. null distribution
+        | "Test_vs_null_FDR": Benjamini-Hochberg correction of "Test_vs_null_pval"
+        | "N_Detections_test": "N_Detections" in cells from test distribution
+        | "N_Detections_null": "N_Detections" in cells from null distribution
+        | "Anchor_shift": cosine shift in response to given perturbation of anchor gene
+        | "Test_token_shift": cosine shift in response to given perturbation of test gene
+        | "Sum_of_indiv_shifts": sum of cosine shifts in response to individually perturbing test and anchor genes
+        | "Combo_shift": cosine shift in response to given perturbation of both anchor and test gene(s) in combination
+        | "Combo_minus_sum_shift": difference of cosine shifts in response combo perturbation vs. sum of individual perturbations
+        |     (i.e. "Combo_shift" minus "Sum_of_indiv_shifts")
+        | "Impact_component": whether the given perturbation was modeled to be within the impact component by the mixture model
+        |     1: within impact component; 0: not within impact component
+        | "Impact_component_percent": percent of cells in which given perturbation was modeled to be within impact component
+        | In case of aggregating gene shifts:
+        | "Perturbed": ID(s) of gene(s) being perturbed
+        | "Affected": ID of affected gene or "cell_emb" indicating the impact on the cell embedding as a whole
+        | "Cosine_shift_mean": mean of cosine shift of modeled perturbation on affected gene or cell
+        | "Cosine_shift_stdev": standard deviation of cosine shift of modeled perturbation on affected gene or cell
         """
         if self.mode not in [

geneformer/tokenizer.py CHANGED Viewed

@@ -1,17 +1,37 @@
 """
 Geneformer tokenizer.
-Input data:
-Required format: raw counts scRNAseq data without feature selection as .loom file
-Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
-Required col (cell) attribute: "n_counts"; total read counts in that cell
-Optional col (cell) attribute: "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria
-Optional col (cell) attributes: any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below
-Usage:
-  from geneformer import TranscriptomeTokenizer
-  tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
-  tk.tokenize_data("data_directory", "output_directory", "output_prefix")
 """
 from __future__ import annotations
@@ -68,11 +88,11 @@ class TranscriptomeTokenizer:
         Initialize tokenizer.
         Parameters
-        ----------
         custom_attr_name_dict : None, dict
-            Dictionary of custom attributes to be added to the dataset.
-            Keys are the names of the attributes in the loom file.
-            Values are the names of the attributes in the dataset.
         nproc : int
             Number of processes to use for dataset mapping.
         chunk_size: int = 512
@@ -119,7 +139,7 @@ class TranscriptomeTokenizer:
         Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.
         Parameters
-        ----------
         data_directory : Path
             Path to directory containing loom files or anndata files
         output_directory : Path

 """
 Geneformer tokenizer.
+**Input data:**
+| *Required format:* raw counts scRNAseq data without feature selection as .loom or anndata file.
+| *Required row (gene) attribute:* "ensembl_id"; Ensembl ID for each gene.
+| *Required col (cell) attribute:* "n_counts"; total read counts in that cell.
+| *Optional col (cell) attribute:* "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria.
+| *Optional col (cell) attributes:* any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below.
+**Usage:**
+.. code-block :: python
+    >>> from geneformer import TranscriptomeTokenizer
+    >>> tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ"}, nproc=4)
+    >>> tk.tokenize_data("data_directory", "output_directory", "output_prefix")
+**Description:**
+| Input data is a directory with .loom or .h5ad files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. The input file type is specified by the argument file_format in the tokenize_data function.
+| The discussion below references the .loom file format, but the analagous labels are required for .h5ad files, just that they will be column instead of row attributes and vice versa due to the transposed format of the two file types.
+| Genes should be labeled with Ensembl IDs (loom row attribute "ensembl_id"), which provide a unique identifer for conversion to tokens. Other forms of gene annotations (e.g. gene names) can be converted to Ensembl IDs via Ensembl Biomart. Cells should be labeled with the total read count in the cell (loom column attribute "n_counts") to be used for normalization.
+| No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes "cell_type" and "organ_major" and one would like to retain these attributes as labels in the tokenized dataset with the new names "cell_type" and "organ", respectively, the following custom attribute dictionary should be provided: {"cell_type": "cell_type", "organ_major": "organ"}.
+| Additionally, if the original .loom file contains a cell column attribute called "filter_pass", this column will be used as a binary indicator of whether to include these cells in the tokenized data. All cells with "1" in this attribute will be tokenized, whereas the others will be excluded. One may use this column to indicate QC filtering or other criteria for selection for inclusion in the final tokenized dataset.
+| If one's data is in other formats besides .loom or .h5ad, one can use the relevant tools (such as Anndata tools) to convert the file to a .loom or .h5ad format prior to running the transcriptome tokenizer.
 """
 from __future__ import annotations
         Initialize tokenizer.
         Parameters
+        ~~~~~~~~~~
         custom_attr_name_dict : None, dict
+            | Dictionary of custom attributes to be added to the dataset.
+            | Keys are the names of the attributes in the loom file.
+            | Values are the names of the attributes in the dataset.
         nproc : int
             Number of processes to use for dataset mapping.
         chunk_size: int = 512
         Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.
         Parameters
+        ~~~~~~~~~~
         data_directory : Path
             Path to directory containing loom files or anndata files
         output_directory : Path