Error when tokenizing large datasets

#80
by yanwu2014 - opened

Thanks for the great package! So when I try to tokenize a large dataset (1 million + cells) I'm getting this error

Traceback (most recent call last):
  File "/home/ywu/git-repos/prime-analysis/scripts/tokenize_data.py", line 12, in <module>

  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/geneformer/tokenizer.py", line 102, in tokenize_data
    tokenized_dataset = self.create_dataset(tokenized_cells, cell_metadata)
  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/geneformer/tokenizer.py", line 194, in create_dataset
    output_dataset = Dataset.from_dict(dataset_dict)
  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 897, in from_dict
    pa_table = InMemoryTable.from_pydict(mapping=mapping)
  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/datasets/table.py", line 785, in from_pydict
    return cls(pa.Table.from_pydict(*args, **kwargs))
  File "pyarrow/table.pxi", line 3725, in pyarrow.lib.Table.from_pydict
  File "pyarrow/table.pxi", line 5254, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 350, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/datasets/arrow_writer.py", line 186, in __arrow_array__
    out = list_of_np_array_to_pyarrow_listarray(data)
  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/datasets/features/features.py", line 1406, in list_of_np_array_to_pyarrow_listarray
    return list_of_pa_arrays_to_pyarrow_listarray(
  File "/share/ywu/anaconda3/envs/geneformer-env/lib/python3.10/site-packages/datasets/features/features.py", line 1398, in list_of_pa_arrays_to_pyarrow_listarray
    offsets = pa.array(offsets, type=pa.int32())
  File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147486544 too large to fit in C integer type

It's a bit unclear to me where the very large number that doesn't fit in a 32 bit int even comes from, any thoughts on what might be happening?

Thank you for your interest in Geneformer! We did not encounter this error when tokenizing Genecorpus-30M, which is ~30M cells. We would suggest you check the issues reported for Huggingface Datasets and/or open a new issue relating to this. It will be helpful to include information relating to the number of examples (cells) and the number of tokens per example (genes per cell) you in your dataset.

Closing this discussion for now, but please feel free to update with information that may be helpful to others who have the same question if you resolved this.

ctheodoris changed discussion status to closed

@ctheodoris @yanwu2014 I ran into the same error and fixed it by changing the TranscriptomeTokenizer.create_dataset() function to the following:

# [...]

def create_dataset(self, tokenized_cells, cell_metadata):
        # create dict for dataset creation
        dataset_dict = {"input_ids": tokenized_cells}
        if self.custom_attr_name_dict is not None:
            dataset_dict.update(cell_metadata)
       
        # changed this line:
        # output_dataset = Dataset.from_dict(dataset_dict)

        # to this:
        def gen():
            for i in range(len(tokenized_cells)):
                yield {'input_ids': dataset_dict['input_ids'][i], 'cell_type': dataset_dict['cell_type'][i]}
        output_dataset = Dataset.from_generator(gen, num_proc=self.nproc)

        # truncate dataset
        def truncate(example):
            example["input_ids"] = example["input_ids"][0:2048]
            return example

        output_dataset_truncated = output_dataset.map(truncate, num_proc=self.nproc)

        # measure lengths of dataset
        def measure_length(example):
            example["length"] = len(example["input_ids"])
            return example

        output_dataset_truncated_w_length = output_dataset_truncated.map(
            measure_length, num_proc=self.nproc
        )

        return output_dataset_truncated_w_length

Thank you for this update! From your experience did you notice that this error was occurring with datasets above a given size? Also did you happen to note whether the method you changed it to is just as fast as the original method for datasets where they don’t trigger this error? Thank you!

From your experience did you notice that this error was occurring with datasets above a given size?

Roughly, it worked for 500k cells and broke for 1M. This updated version immediately worked for 1M.

Also did you happen to note whether the method you changed it to is just as fast as the original method for datasets where they don’t trigger this error?

I should mention that I used the anndata tokenizer from here: https://huggingface.co/ctheodoris/Geneformer/discussions/102 because when converting from adata to loom apparently loom needs to allocate a whole dense array and that broke my RAM. With that, tokenizing took 251.39s for 500k cells with my updated version above and 311.28s with the old version (that used .from_dict()).

Thank you for the information and for also confirming that the anndata tokenizer version works with large datasets. Do you want to submit a pull request with the version you implemented? We did not encounter the .from_dict error when tokenizing the 30M cells in Genecorpus-30M so I am not certain the specific case that is revealing this error, but it’s very helpful that you updated here with your solution.

Just tested the PR and it seemed to fix my issue, thanks for writing this!

I am still seeing this exception

pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type

even with the changes committed to fix it. @ctheodoris @ricomnl any idea what might still be causing this? It is happening with just 1.8 million cells (1775249). Here is the full exception:

Traceback (most recent call last):
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 143, in <module>
    main()
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 139, in main
    tokenize_training_data()
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 134, in tokenize_training_data
    tokenize(downsampling_method=downsampling_method, percentage=percentage, seed=seed, data_dir=sctab_directory, output_dir=output_dir)
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 92, in tokenize
    tk.tokenize_data(h5ad_directory, output_directory, output_prefix, file_format="h5ad")
  File "/scratch/amlt_code/Geneformer/geneformer/tokenizer.py", line 172, in tokenize_data
    tokenized_dataset = self.create_dataset(
  File "/scratch/amlt_code/Geneformer/geneformer/tokenizer.py", line 369, in create_dataset
    output_dataset = Dataset.from_dict(dataset_dict)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 963, in from_dict
    pa_table = InMemoryTable.from_pydict(mapping=mapping)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/table.py", line 758, in from_pydict
    return cls(pa.Table.from_pydict(*args, **kwargs))
  File "pyarrow/table.pxi", line 1920, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5992, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 385, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 247, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 112, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/arrow_writer.py", line 190, in __arrow_array__
    out = list_of_np_array_to_pyarrow_listarray(data)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/features/features.py", line 1465, in list_of_np_array_to_pyarrow_listarray
    return list_of_pa_arrays_to_pyarrow_listarray(
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/features/features.py", line 1457, in list_of_pa_arrays_to_pyarrow_listarray
    offsets = pa.array(offsets, type=pa.int32())
  File "pyarrow/array.pxi", line 345, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type

Thank you!

Thanks for your question! Please set use_generator to True if you encounter this error.

Linking prior discussions here: https://huggingface.co/ctheodoris/Geneformer/discussions/315

Sign up or log in to comment