pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type
#347
by
alandenadel
- opened
I am attempting to tokenize 1.8 million cells and seeing the exception
pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type
Which was first reported in https://huggingface.co/ctheodoris/Geneformer/discussions/80. I commented there, but because the issue is closed, I'm opening up this new issue because I am unsure if this is a regression.
Here is the full exception, it is occurring with ~1.8 million cells:
Traceback (most recent call last):
File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 143, in <module>
main()
File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 139, in main
tokenize_training_data()
File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 134, in tokenize_training_data
tokenize(downsampling_method=downsampling_method, percentage=percentage, seed=seed, data_dir=sctab_directory, output_dir=output_dir)
File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 92, in tokenize
tk.tokenize_data(h5ad_directory, output_directory, output_prefix, file_format="h5ad")
File "/scratch/amlt_code/Geneformer/geneformer/tokenizer.py", line 172, in tokenize_data
tokenized_dataset = self.create_dataset(
File "/scratch/amlt_code/Geneformer/geneformer/tokenizer.py", line 369, in create_dataset
output_dataset = Dataset.from_dict(dataset_dict)
File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 963, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/table.py", line 758, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File "pyarrow/table.pxi", line 1920, in pyarrow.lib._Tabular.from_pydict
File "pyarrow/table.pxi", line 5992, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 385, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 247, in pyarrow.lib.array
File "pyarrow/array.pxi", line 112, in pyarrow.lib._handle_arrow_array_protocol
File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/arrow_writer.py", line 190, in __arrow_array__
out = list_of_np_array_to_pyarrow_listarray(data)
File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/features/features.py", line 1465, in list_of_np_array_to_pyarrow_listarray
return list_of_pa_arrays_to_pyarrow_listarray(
File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/features/features.py", line 1457, in list_of_pa_arrays_to_pyarrow_listarray
offsets = pa.array(offsets, type=pa.int32())
File "pyarrow/array.pxi", line 345, in pyarrow.lib.array
File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type
Thank you!
Thanks for your question! Please set use_generator to True if you encounter this error.
Linking prior discussions here: https://huggingface.co/ctheodoris/Geneformer/discussions/315
ctheodoris
changed discussion status to
closed