google/gemma-2-9b-it · error of ATen\native\cuda\IndexKernel.cu

Jun 29, 2024

•

edited Jun 29, 2024

The following error occurs and generation fails when input exceeding the context length of 8192 is given during inference.Does anyone know what the solution is?

error and output:
(llm) C:\code\llm>python gemma2-9b_test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 3.23s/it]
c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py:577: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
...
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [92,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [93,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [94,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [16485,0,0], thread: [95,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "C:\code\llm\gemma2-9b_test.py", line 73, in
output = pipe(messages, **generation_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\text_generation.py", line 257, in call
return super().call(Chat(text_inputs), **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\base.py", line 1254, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\base.py", line 1261, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\base.py", line 1161, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\pipelines\text_generation.py", line 349, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\generation\utils.py", line 1914, in generate
result = self._sample(
^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\generation\utils.py", line 2651, in _sample
outputs = self(
^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 1063, in forward
outputs = self.model(
^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 903, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 645, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\models\gemma2\modeling_gemma2.py", line 557, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\cache_utils.py", line 1071, in update
return update_fn(
^^^^^^^^^^
File "c:\code\virtualens\llm\Lib\site-packages\transformers\cache_utils.py", line 1046, in _static_update
k_out[:, :, cache_position] = key_states
~~~~~^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

pip package:

accelerate 0.31.0
certifi 2024.6.2
charset-normalizer 3.3.2
colorama 0.4.6
einops 0.8.0
filelock 3.13.1
flash-attn 2.5.9.post1
fsspec 2024.2.0
huggingface-hub 0.23.4
idna 3.7
intel-openmp 2021.4.0
Jinja2 3.1.3
MarkupSafe 2.1.5
mkl 2021.4.0
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.3
packaging 24.1
pillow 10.2.0
pip 24.1.1
psutil 5.9.8
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
safetensors 0.4.3
setuptools 65.5.0
sympy 1.12
tbb 2021.11.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.42.2
typing_extensions 4.9.0
urllib3 2.2.1
wheel 0.43.0

generation code:

import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datetime import datetime

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

HF_TOKEN = '*****'
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
token=HF_TOKEN,
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it", token=HF_TOKEN)

messages = [
{"role": "user", "content": f"""\messages......"""},

]

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)

generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.2,
"do_sample": True,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

lysandre

Google org Jul 1, 2024

Hey @koromatsu , thanks for your report! Could you upgrade transformers to the latest version (v4.42.3) and let me know if you still have the same issue?

koromatsu

Jul 1, 2024

@lysandre
Thanks for the reply.
I tried updating the transformers to the specified version, but the situation remained the same and the same error occurred...

brettbj

Jul 3, 2024

•

edited Jul 3, 2024

@lysandre I think this might be an issue with the sliding window. I'm able to replicate on 4.42.3 whenever crossing the threshold of 4096 tokens as a combination of input and output.

For example - tokenized input:
print(input_ids['input_ids'].size()[1])
returns: 3922

This works:
outputs = model.generate(**input_ids,
max_new_tokens=173,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(tokenizer.decode(outputs[0]))

This fails:
outputs = model.generate(**input_ids,
max_new_tokens=180,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(tokenizer.decode(outputs[0]))

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [66,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
...

This occurs even with eager attention:
eager <class 'transformers.models.gemma2.modeling_gemma2.Gemma2Attention'>

I believe this is the route of many of the issues e.g.,: https://huggingface.co/google/gemma-2-27b-it/discussions/9

It looks like this is a potential fix: https://huggingface.co/google/gemma-2-9b/discussions/8 Will try and report back

brettbj

Jul 3, 2024

Digging deeper, that attempted fix was for flash attention and won't help with eager. Turning cache off doesn't throw the error, but generating a single output was running for >5 mins before I killed it because even if it works it's not practical (vs. ~12 seconds with it on and a few tokens shorter).

I'm trying to figure out what the issue is with the sliding window and the cache but it might be over my head

Renu11

Google org Aug 8, 2024

•

edited Aug 8, 2024

Hi @koromatsu , The above error indicates an "out-of-bounds" access issue during the model's internal processing on the GPU. Could you please try again by setting the device_map="auto", to let the library automatically determine the best available devices (CPUs or GPUs) to load the model onto? Let us know if this helps! Thank you.

koromatsu

Aug 19, 2024

@Renu11 Thanks for the response. After trying various things, the original error was no longer reproduced and a simple OutOfMemory was substituted. I'm closing this thread because it is no longer necessary to use 9b and I can't reproduce the problem.

koromatsu changed discussion status to closed Aug 19, 2024