Christina Theodoris commited on
Commit
f75f5ac
1 Parent(s): b294421

Update readthedocs for classifier

Browse files
docs/source/geneformer.classifier.rst CHANGED
@@ -6,4 +6,5 @@ geneformer.classifier
6
  :undoc-members:
7
  :show-inheritance:
8
  :exclude-members:
 
9
  validate_options
 
6
  :undoc-members:
7
  :show-inheritance:
8
  :exclude-members:
9
+ valid_option_dict,
10
  validate_options
geneformer/classifier.py CHANGED
@@ -3,14 +3,11 @@ Geneformer classifier.
3
 
4
  **Input data:**
5
 
6
- Cell state classifier:
7
- | Single-cell transcriptomes as Geneformer rank value encodings with cell state labels
8
- | in Geneformer .dataset format (generated from single-cell RNAseq data by tokenizer.py)
9
-
10
- Gene classifier:
11
- | Dictionary in format {Gene_label: list(genes)} for gene labels
12
- | and single-cell transcriptomes as Geneformer rank value encodings
13
- | in Geneformer .dataset format (generated from single-cell RNAseq data by tokenizer.py)
14
 
15
  **Usage:**
16
 
 
3
 
4
  **Input data:**
5
 
6
+ | Cell state classifier:
7
+ | Single-cell transcriptomes as Geneformer rank value encodings with cell state labels in Geneformer .dataset format (generated from single-cell RNAseq data by tokenizer.py)
8
+
9
+ | Gene classifier:
10
+ | Dictionary in format {Gene_label: list(genes)} for gene labels and single-cell transcriptomes as Geneformer rank value encodings in Geneformer .dataset format (generated from single-cell RNAseq data by tokenizer.py)
 
 
 
11
 
12
  **Usage:**
13
 
geneformer/tokenizer.py CHANGED
@@ -89,7 +89,9 @@ class TranscriptomeTokenizer:
89
  ):
90
  """
91
  Initialize tokenizer.
 
92
  **Parameters:**
 
93
  custom_attr_name_dict : None, dict
94
  | Dictionary of custom attributes to be added to the dataset.
95
  | Keys are the names of the attributes in the loom file.
@@ -98,15 +100,16 @@ class TranscriptomeTokenizer:
98
  | Number of processes to use for dataset mapping.
99
  chunk_size : int = 512
100
  | Chunk size for anndata tokenizer.
101
- model_input_size: int = 2048
102
  | Max input size of model to truncate input to.
103
- special_token: bool = False
104
- | Option to add CLS and SEP tokens
105
  gene_median_file : Path
106
  | Path to pickle file containing dictionary of non-zero median
107
  | gene expression values across Genecorpus-30M.
108
  token_dictionary_file : Path
109
  | Path to pickle file containing token dictionary (Ensembl IDs:token).
 
110
  """
111
  # dictionary of custom attributes {output dataset column name: input .loom column name}
112
  self.custom_attr_name_dict = custom_attr_name_dict
@@ -148,7 +151,9 @@ class TranscriptomeTokenizer:
148
  ):
149
  """
150
  Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.
 
151
  **Parameters:**
 
152
  data_directory : Path
153
  | Path to directory containing loom files or anndata files
154
  output_directory : Path
@@ -159,6 +164,7 @@ class TranscriptomeTokenizer:
159
  | Format of input files. Can be "loom" or "h5ad".
160
  use_generator : bool
161
  | Whether to use generator or dict for tokenization.
 
162
  """
163
  tokenized_cells, cell_metadata = self.tokenize_files(
164
  Path(data_directory), file_format
 
89
  ):
90
  """
91
  Initialize tokenizer.
92
+
93
  **Parameters:**
94
+
95
  custom_attr_name_dict : None, dict
96
  | Dictionary of custom attributes to be added to the dataset.
97
  | Keys are the names of the attributes in the loom file.
 
100
  | Number of processes to use for dataset mapping.
101
  chunk_size : int = 512
102
  | Chunk size for anndata tokenizer.
103
+ model_input_size : int = 2048
104
  | Max input size of model to truncate input to.
105
+ special_token : bool = False
106
+ | Adds CLS token before and SEP token after rank value encoding.
107
  gene_median_file : Path
108
  | Path to pickle file containing dictionary of non-zero median
109
  | gene expression values across Genecorpus-30M.
110
  token_dictionary_file : Path
111
  | Path to pickle file containing token dictionary (Ensembl IDs:token).
112
+
113
  """
114
  # dictionary of custom attributes {output dataset column name: input .loom column name}
115
  self.custom_attr_name_dict = custom_attr_name_dict
 
151
  ):
152
  """
153
  Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.
154
+
155
  **Parameters:**
156
+
157
  data_directory : Path
158
  | Path to directory containing loom files or anndata files
159
  output_directory : Path
 
164
  | Format of input files. Can be "loom" or "h5ad".
165
  use_generator : bool
166
  | Whether to use generator or dict for tokenization.
167
+
168
  """
169
  tokenized_cells, cell_metadata = self.tokenize_files(
170
  Path(data_directory), file_format