RuntimeError: CUDA error: device-side assert triggered during Geneformer validation
Hello Hugging Face Community,
I am currently working with Geneformer (6-layer model) from this Hugging Face repository and I encountered the following error during validation:
RuntimeError Traceback (most recent call last)
Cell In[38], line 6
1 train_valid_id_split_dict = {"attr_key": "individual",
2 "train": train_ids,
3 "eval": eval_ids}
5 # 6 layer Geneformer: https://huggingface.co/ctheodoris/Geneformer/blob/main/model.safetensors
----> 6 all_metrics = cc.validate(model_directory="./",
7 prepared_input_data_file=f"{output_dir}/{output_prefix}_labeled_train.dataset",
8 id_class_dict_file=f"{output_dir}/{output_prefix}_id_class_dict.pkl",
9 output_directory=output_dir,
10 output_prefix=output_prefix,
11 split_id_dict=train_valid_id_split_dict)
12 # to optimize hyperparameters, set n_hyperopt_trials=100 (or alternative desired # of trials)
File ~/Geneformer-new/Geneformer/geneformer/classifier.py:785, in Classifier.validate(self, model_directory, prepared_input_data_file, id_class_dict_file, output_directory, output_prefix, split_id_dict, attr_to_split, attr_to_balance, gene_balance, max_trials, pval_threshold, save_eval_output, predict_eval, predict_trainer, n_hyperopt_trials, save_gene_split_datasets, debug_gene_split_datasets)
783 train_data = data.select(train_indices)
784 if n_hyperopt_trials == 0:
--> 785 trainer = self.train_classifier(
786 model_directory,
787 num_classes,
788 train_data,
789 eval_data,
790 ksplit_output_dir,
791 predict_trainer,
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Things I have tried:
Setting CUDA_LAUNCH_BLOCKING=1: This helped me synchronize CUDA operations but I still encountered the same issue, without much more helpful information in the stack trace.
Switching to CPU: Running on CPU works without this issue, which leads me to believe it might be related to GPU or CUDA-specific tensor operations.
There may be a mismatch in the number of classes with the validation/train/test datasets. I recommend checking the classes since it could just be that the model expects a different number of classes than what’s provided in the dataset. I am also wondering what the output is when setting TORCH_USE_CUDA_DSA=1.