error of ATen\native\cuda\IndexKernel.cu
The following error occurs and generation fails when input exceeding the context length of 8192 is given during inference.Does anyone know what the solution is?
- error and output:
(llm) C:\code\llm>python gemma2-9b_test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 3.23s/it]
c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [64,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
...
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [92,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [93,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [94,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [95,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
Traceback (most recent call last):
File "C:\code\llm\gemma2-9b_test.py", line 73, in
output = pipe(messages, **generation_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\text_generation.py", line 257, in call
return super().call(Chat(text_inputs), **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\base.py", line 1254, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\base.py", line 1261, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\base.py", line 1161, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\text_generation.py", line 349, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\generation\utils.py", line 1914, in generate
result = self._sample(
^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\generation\utils.py", line 2651, in _sample
outputs = self(
^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 1063, in forward
outputs = self.model(
^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 903, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 645, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 557, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\cache_utils.py", line 1071, in update
return update_fn(
^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\cache_utils.py", line 1046, in _static_update
k_out[:, :, cache_position] = key_states
~~~~~^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
- pip package:
accelerate 0.31.0
certifi 2024.6.2
charset-normalizer 3.3.2
colorama 0.4.6
einops 0.8.0
filelock 3.13.1
flash-attn 2.5.9.post1
fsspec 2024.2.0
huggingface-hub 0.23.4
idna 3.7
intel-openmp 2021.4.0
Jinja2 3.1.3
MarkupSafe 2.1.5
mkl 2021.4.0
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.3
packaging 24.1
pillow 10.2.0
pip 24.1.1
psutil 5.9.8
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
safetensors 0.4.3
setuptools 65.5.0
sympy 1.12
tbb 2021.11.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.42.2
typing_extensions 4.9.0
urllib3 2.2.1
wheel 0.43.0
- generation code:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datetime import datetime
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
HF_TOKEN = '*****'
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
token=HF_TOKEN,
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it", token=HF_TOKEN)
messages = [
{"role": "user", "content": f"""\messages......"""},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.2,
"do_sample": True,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
Hey
@koromatsu
, thanks for your report! Could you upgrade transformers
to the latest version (v4.42.3) and let me know if you still have the same issue?
@lysandre I think this might be an issue with the sliding window. I'm able to replicate on 4.42.3 whenever crossing the threshold of 4096 tokens as a combination of input and output.
For example - tokenized input:
print(input_ids['input_ids'].size()[1])
returns: 3922
This works:
outputs = model.generate(**input_ids,
max_new_tokens=173,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(tokenizer.decode(outputs[0]))
This fails:
outputs = model.generate(**input_ids,
max_new_tokens=180,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(tokenizer.decode(outputs[0]))
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [66,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
...
This occurs even with eager attention:
eager <class 'transformers.models.gemma2.modeling_gemma2.Gemma2Attention'>
I believe this is the route of many of the issues e.g.,: https://huggingface.co/google/gemma-2-27b-it/discussions/9
It looks like this is a potential fix: https://huggingface.co/google/gemma-2-9b/discussions/8 Will try and report back
Digging deeper, that attempted fix was for flash attention and won't help with eager. Turning cache off doesn't throw the error, but generating a single output was running for >5 mins before I killed it because even if it works it's not practical (vs. ~12 seconds with it on and a few tokens shorter).
I'm trying to figure out what the issue is with the sliding window and the cache but it might be over my head
Hi
@koromatsu
, The above error indicates an "out-of-bounds" access issue during the model's internal processing on the GPU. Could you please try again by setting the device_map="auto",
to let the library automatically determine the best available devices (CPUs or GPUs) to load the model onto? Let us know if this helps! Thank you.