pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type

#347
by alandenadel - opened

I am attempting to tokenize 1.8 million cells and seeing the exception

pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type

Which was first reported in https://huggingface.co/ctheodoris/Geneformer/discussions/80. I commented there, but because the issue is closed, I'm opening up this new issue because I am unsure if this is a regression.

Here is the full exception, it is occurring with ~1.8 million cells:

Traceback (most recent call last):
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 143, in <module>
    main()
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 139, in main
    tokenize_training_data()
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 134, in tokenize_training_data
    tokenize(downsampling_method=downsampling_method, percentage=percentage, seed=seed, data_dir=sctab_directory, output_dir=output_dir)
  File "/scratch/amlt_code/train_scripts/geneformer/tokenize_geneformer.py", line 92, in tokenize
    tk.tokenize_data(h5ad_directory, output_directory, output_prefix, file_format="h5ad")
  File "/scratch/amlt_code/Geneformer/geneformer/tokenizer.py", line 172, in tokenize_data
    tokenized_dataset = self.create_dataset(
  File "/scratch/amlt_code/Geneformer/geneformer/tokenizer.py", line 369, in create_dataset
    output_dataset = Dataset.from_dict(dataset_dict)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 963, in from_dict
    pa_table = InMemoryTable.from_pydict(mapping=mapping)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/table.py", line 758, in from_pydict
    return cls(pa.Table.from_pydict(*args, **kwargs))
  File "pyarrow/table.pxi", line 1920, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5992, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 385, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 247, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 112, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/arrow_writer.py", line 190, in __arrow_array__
    out = list_of_np_array_to_pyarrow_listarray(data)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/features/features.py", line 1465, in list_of_np_array_to_pyarrow_listarray
    return list_of_pa_arrays_to_pyarrow_listarray(
  File "/home/aiscuser/.local/lib/python3.10/site-packages/datasets/features/features.py", line 1457, in list_of_pa_arrays_to_pyarrow_listarray
    offsets = pa.array(offsets, type=pa.int32())
  File "pyarrow/array.pxi", line 345, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147483705 too large to fit in C integer type

Thank you!

Thanks for your question! Please set use_generator to True if you encounter this error.

Linking prior discussions here: https://huggingface.co/ctheodoris/Geneformer/discussions/315

ctheodoris changed discussion status to closed

Sign up or log in to comment