getting this error while trying to fine-tune

#2
by Rapidinnovation - opened

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Yeah I've seen that a few times in multi-GPU situations with unquantised models. I'm afraid I don't know what causes it, but I don't believe it's specific to these files as I've seen it with several Llama models. It might be a Transformers bug.

Sign up or log in to comment