RuntimeError: CUDA error: device-side assert triggered during Geneformer validation

#411
by fs00431 - opened

Hello Hugging Face Community,

I am currently working with Geneformer (6-layer model) from this Hugging Face repository and I encountered the following error during validation:


RuntimeError Traceback (most recent call last)
Cell In[38], line 6
1 train_valid_id_split_dict = {"attr_key": "individual",
2 "train": train_ids,
3 "eval": eval_ids}
5 # 6 layer Geneformer: https://huggingface.co/ctheodoris/Geneformer/blob/main/model.safetensors
----> 6 all_metrics = cc.validate(model_directory="./",
7 prepared_input_data_file=f"{output_dir}/{output_prefix}_labeled_train.dataset",
8 id_class_dict_file=f"{output_dir}/{output_prefix}_id_class_dict.pkl",
9 output_directory=output_dir,
10 output_prefix=output_prefix,
11 split_id_dict=train_valid_id_split_dict)
12 # to optimize hyperparameters, set n_hyperopt_trials=100 (or alternative desired # of trials)

File ~/Geneformer-new/Geneformer/geneformer/classifier.py:785, in Classifier.validate(self, model_directory, prepared_input_data_file, id_class_dict_file, output_directory, output_prefix, split_id_dict, attr_to_split, attr_to_balance, gene_balance, max_trials, pval_threshold, save_eval_output, predict_eval, predict_trainer, n_hyperopt_trials, save_gene_split_datasets, debug_gene_split_datasets)
783 train_data = data.select(train_indices)
784 if n_hyperopt_trials == 0:
--> 785 trainer = self.train_classifier(
786 model_directory,
787 num_classes,
788 train_data,
789 eval_data,
790 ksplit_output_dir,
791 predict_trainer,
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Things I have tried:

Setting CUDA_LAUNCH_BLOCKING=1: This helped me synchronize CUDA operations but I still encountered the same issue, without much more helpful information in the stack trace.

Switching to CPU: Running on CPU works without this issue, which leads me to believe it might be related to GPU or CUDA-specific tensor operations.

There may be a mismatch in the number of classes with the validation/train/test datasets. I recommend checking the classes since it could just be that the model expects a different number of classes than what’s provided in the dataset. I am also wondering what the output is when setting TORCH_USE_CUDA_DSA=1.

Closing for now, feel free to reopen if still not resolved

ctheodoris changed discussion status to closed

Sign up or log in to comment