No description provided.

Thank you for sharing your contribution in this pull request!

  1. Regarding the Dataset.from_generator, when testing it with multiple runs to get a better estimate (since runs can vary in time at random), it unfortunately does seem to be slower than Dataset.from_dict for larger datasets. This is perhaps to be expected given the for loop and generation portion. Of note, Datasets caches the generator so repeated runs may lead to the misleading impression that it is faster, while repeated generation is not practically how users would access this function.

Dataset size: from_dict vs from_generator
Small dataset ~5K cells: 0.071 vs 0.072
Medium dataset ~200K cells: 3.000 vs 20.635
Large dataset ~1M cells: 15.761 vs 100.186

Upon searching the Huggingface Datasets issues though, the error in discussion #80 appears to be a known problem that they are working on resolving, so hopefully in future versions this will not be an issue. I would like to explore some other options but I am not encountering the error when testing datasets with ~1M cells. Would you be able to share one of the problematic datasets with me so that I can reproduce the error and therefore work on resolving it?

  1. For the anndata tokenizer, I tested it with an .h5ad dataset with ~1M cells and encountered a few errors so far. The first error was not accounting for the possibility of no metadata dictionary, so I added if statements to handle that case similarly to the loom version. However, the next error is below. I believe it may be due to not handling the case of the filter_pass existing, leading to adata_filter.X involving the data filtered by both cells and genes on line 203 below, while the adata.X is not filtered by cells.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 3
      1 from geneformer import TranscriptomeTokenizer
      2 tk = TranscriptomeTokenizer(nproc=16)
----> 3 tk.tokenize_data("/path/to/h5ad_1Mcells", 
      4                  "/path/to/h5ad_1Mcells/output", 
      5                  "tokenized_h5ad_1Mcells",
      6                  file_format="h5ad")

File ~/Geneformer/geneformer/tokenizer.py:117, in TranscriptomeTokenizer.tokenize_data(self, data_directory, output_directory, output_prefix, file_format)
     97 def tokenize_data(
     98     self,
     99     data_directory: Path | str,
   (...)
    102     file_format: Literal["loom", "h5ad"] = "loom",
    103 ):
    104     """
    105     Tokenize .loom files in loom_data_directory and save as tokenized .dataset in output_directory.
    106     Parameters
   (...)
    115         Format of input files. Can be "loom" or "h5ad".
    116     """
--> 117     tokenized_cells, cell_metadata = self.tokenize_files(
    118         Path(data_directory), file_format
    119     )
    120     tokenized_dataset = self.create_dataset(tokenized_cells, cell_metadata)
    122     output_path = (Path(output_directory) / output_prefix).with_suffix(".dataset")

File ~/Geneformer/geneformer/tokenizer.py:146, in TranscriptomeTokenizer.tokenize_files(self, data_directory, file_format)
    144 file_found = 1
    145 print(f"Tokenizing {file_path}")
--> 146 file_tokenized_cells, file_cell_metadata = tokenize_file_fn(file_path)
    147 tokenized_cells += file_tokenized_cells
    148 if self.custom_attr_name_dict is not None:

File ~/Geneformer/geneformer/tokenizer.py:203, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path)
    198 tokenized_cells = []
    199 adata_filter = adata[
    200     filter_pass_loc, coding_miRNA_loc  # filter cells and genes
    201 ]
--> 203 X_norm = (adata_filter.X / adata.X.sum(1) * 10_000 / norm_factor_vector).tocsr()
    205 tokenized_cells += [
    206     tokenize_cell(X_norm[i, ...].A.flatten(), coding_miRNA_tokens)
    207     for i in range(X_norm.shape[0])
    208 ]
    210 # add custom attributes for subview to dict

File ~/miniconda3/lib/python3.10/site-packages/scipy/sparse/_base.py:686, in spmatrix.__truediv__(self, other)
    685 def __truediv__(self, other):
--> 686     return self._divide(other, true_divide=True)

File ~/miniconda3/lib/python3.10/site-packages/scipy/sparse/_base.py:665, in spmatrix._divide(self, other, true_divide, rdivide)
    663 if not rdivide:
    664     if true_divide:
--> 665         return np.true_divide(self.todense(), other)
    666     else:
    667         return np.divide(self.todense(), other)

ValueError: operands could not be broadcast together with shapes (986122,24124) (1002756,1) 

However, when I add filtering for the cells to get adata_cell_filter (see below), the operation leads to the error that the matrix object has no attribute "tocsr".

File ~/Geneformer/geneformer/tokenizer.py:209, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path)
    202 adata_filter = adata[
    203     filter_pass_loc, coding_miRNA_loc  # filter cells and genes
    204 ]
    205 adata_cell_filter = adata[
    206     filter_pass_loc, :  # filter cells only
    207 ]
--> 209 X_norm = (adata_filter.X / adata_cell_filter.X.sum(1) * 10_000 / norm_factor_vector).tocsr()
    211 tokenized_cells += [
    212     tokenize_cell(X_norm[i, ...].A.flatten(), coding_miRNA_tokens)
    213     for i in range(X_norm.shape[0])
    214 ]
    216 # add custom attributes for subview to dict

AttributeError: 'matrix' object has no attribute 'tocsr'

It would be great if you could take a look into resolving these for the anndata version. I'm not sure if you encountered something similar.

Additionally, I noticed that the anndata version does not perform this operation by scanning through the file the way that the look version does. It requires >500G RAM to perform this operation for large datasets ~1M cells. If you know of an anndata function that would be preferable to allow scanning through the file in chunks similar to the loom version, that would be great to avoid memory constraints.

Thank you for your collaboration on this!

The first error was not accounting for the possibility of no metadata dictionary, so I added if statements to handle that case similarly to the loom version

What arguments did you pass to the TranscriptomeTokenizer that made those if statements necessary? I ran it with:

tk = TranscriptomeTokenizer({})

# and

tk = TranscriptomeTokenizer({"cell_type": "cell_type"}) 

and both worked fine

I also did not run into the 2nd issue you mentioned above. For me, changing line 203 in tokenizer.py to:

X_norm = (adata_filter.X / adata[filter_pass_loc].X.sum(1) * 10_000 / norm_factor_vector).tocsr()

made it work.

My relevant package versions are:

anndata==0.9.1
arrow==1.2.3
datasets==2.13.1
numpy==1.24.3
pyarrow==12.0.1
scipy==1.11.0
torch==2.0.1
transformers==4.30.2

Which versions are you using?

@ctheodoris I addressed the issues you mentioned above by casting the matrix to a csr matrix, scanning through the anndata object (in "backed" mode) instead of loading it into memory and using a parameter to decide whether to use from_dict or from_generator

Thank you so much! That’s wonderful. I’m currently away with limited internet access so will test the updated version when I return and merge it if all looks good. Thank you for your key contribution to the code base!

Thank you again for your collaboration on this. I have returned and am testing out the new version. A couple of remaining issues:

  1. The default for custom_attr_name_dict is None so if it isn't specified (see how I ran the code in the error trace below) the anndata tokenizer has an error. There are if statements in the loom version that account for this. I added them to resolve this:

Changes to add:
Lines 168-171:

        if self.custom_attr_name_dict is not None:
            file_cell_metadata = {
                attr_key: [] for attr_key in self.custom_attr_name_dict.keys()
            }

Lines 218-223:

            # add custom attributes for subview to dict
            if self.custom_attr_name_dict is not None:
                for k in file_cell_metadata.keys():
                    file_cell_metadata[k] += adata[idx].obs[k].tolist()
            else:
                file_cell_metadata = None
  1. I encountered the error below when calculating the X_norm. Are you able to resolve this? (see how I ran the code in the error trace below; the anndata file has a filter_pass cell attribute)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 3
      1 from geneformer import TranscriptomeTokenizer
      2 tk = TranscriptomeTokenizer(nproc=16)
----> 3 tk.tokenize_data("/path/to/h5ad_1Mcells", 
      4                  "/path/to/h5ad_1Mcells/output", 
      5                  "tokenized_h5ad_1Mcells",
      6                  file_format="h5ad")

File ~/Geneformer/geneformer/tokenizer.py:128, in TranscriptomeTokenizer.tokenize_data(self, data_directory, output_directory, output_prefix, file_format, use_generator)
    105 def tokenize_data(
    106     self,
    107     data_directory: Path | str,
   (...)
    111     use_generator: bool = False,
    112 ):
    113     """
    114     Tokenize .loom files in loom_data_directory and save as tokenized .dataset in output_directory.
    115     Parameters
   (...)
    126         Whether to use generator or dict for tokenization.
    127     """
--> 128     tokenized_cells, cell_metadata = self.tokenize_files(
    129         Path(data_directory), file_format
    130     )
    131     tokenized_dataset = self.create_dataset(tokenized_cells, cell_metadata, use_generator=use_generator)
    133     output_path = (Path(output_directory) / output_prefix).with_suffix(".dataset")

File ~/Geneformer/geneformer/tokenizer.py:152, in TranscriptomeTokenizer.tokenize_files(self, data_directory, file_format)
    150 file_found = 1
    151 print(f"Tokenizing {file_path}")
--> 152 file_tokenized_cells, file_cell_metadata = tokenize_file_fn(file_path)
    153 tokenized_cells += file_tokenized_cells
    154 if self.custom_attr_name_dict is not None:

File ~/Geneformer/geneformer/tokenizer.py:210, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path, target_sum, chunk_size)
    207 idx = filter_pass_loc[i:i+chunk_size]
    208 X = adata[idx].X
--> 210 X_norm = (X / X[:, coding_miRNA_loc].sum(axis=1) * target_sum / norm_factor_vector)
    211 X_norm = sp.csr_matrix(X_norm)
    213 tokenized_cells += [
    214     rank_genes(X_norm[i].data, coding_miRNA_tokens[X_norm[i].indices])
    215     for i in range(X_norm.shape[0])
    216 ]

ValueError: operands could not be broadcast together with shapes (512,63561) (24124,) 

@ctheodoris I addressed your issues, let me know if they're fixed

Thank you so much for addressing these! Indeed the code is able to run now. However, when I check the results, it seems they are not the same between the anndata and loom versions unfortunately. I used scanpy to convert a .loom file to an .h5ad file so they would be the same, and then ran them each through the transcriptome tokenizer, either specifying to use the anndata version or not specifying a file type (thereby using the default loom one). When I use the following to create a checksum column in the datasets, and then create a set out of the checksum column, the two sets are not the same. They have the same number of entries (cells) but the checksum itself is not the same (each set has 986122 cells, but then merging them has more than 986122 (986320), indicating that some are not the same). I am happy to send you these input/output datasets if you email me so we can troubleshoot the reasons behind this. Did you check the outputs previously and find they were the same though?

def create_checksum(example):
    example["checksum"] = hash(tuple(example["input_ids"]))
    return example
test_dataset = test_dataset.map(create_checksum, num_proc=16)

Thank you again for your collaboration on this!

Looks great, checksums match - thank you so much for all your collaboration on this. It is a valuable contribution that will be helpful to many researchers.

ctheodoris changed pull request status to open
ctheodoris changed pull request status to merged

Sign up or log in to comment