No operator found for `memory_efficient_attention_forward` error when set use_memory_efficient_attention=True
#8
by
xiewk
- opened
Hi,
When I load the model with use_memory_efficient_attention=True, the inference failed with "No operator found for memory_efficient_attention_forward
" error . Do you know why?
My code is as below, which is basically a copy of your sample code with only very slight changes.
import torch.nn.functional as F
import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
model_name_or_path = 'Alibaba-NLP/gte-large-en-v1.5'
revision = 'main'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, code_revision=revision)
model = AutoModel.from_pretrained(
model_name_or_path,
revision=revision,
trust_remote_code=True,
use_memory_efficient_attention=True,
device_map='cuda',
)
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
with torch.autocast(device_type='cuda', dtype=torch.float16):
outputs = model(**{
k: v.to(model.device) if isinstance(v, torch.Tensor) else v
for k, v in batch_dict.items()
})
embeddings = outputs.last_hidden_state[:, 0]
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
My environment is Ubuntu 22.04 with Nvidia RTX 3050Ti GPU with CUDA 12.2.
Versions of key packages:
python=3.10
torch==2.3.0
xformers==0.0.26.post1
transformers=4.40.1
The error message:
File "/home/victor/.cache/huggingface/modules/transformers_modules/Alibaba-NLP/new-impl/b7ea01ba91f26ef946f8c25261151b13aa502268/modeling.py", line 499, in forward
context_layer = self.memory_efficient_attention(
File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 268, in memory_efficient_attention
return _memory_efficient_attention(
File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 392, in _memory_efficient_attention
return _fMHA.apply(
File "/home/victor/workspace/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 67, in forward
out, op_ctx = _memory_efficient_attention_forward_requires_grad(
File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 417, in _memory_efficient_attention_forward_requires_grad
op = _dispatch_fw(inp, True)
File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 125, in _dispatch_fw
return _run_priority_list(
File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 65, in _run_priority_list
raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
query : shape=(4, 10, 16, 64) (torch.float16)
key : shape=(4, 10, 16, 64) (torch.float16)
value : shape=(4, 10, 16, 64) (torch.float16)
attn_bias : <class 'torch.Tensor'>
p : 0.0
`flshattF@v2.5.6` is not supported because:
attn_bias type is <class 'torch.Tensor'>
`cutlassF` is not supported because:
attn_bias.stride(-2) % 8 != 0 (attn_bias.stride() = (1600, 100, 10, 1))
attn_bias.stride(-2) % 8 != 0 (attn_bias.stride() = (1600, 100, 10, 1))
HINT: To use an `attn_bias` with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: use `attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]` instead of `torch.zeros([1, 1, 5, 5])`
`smallkF` is not supported because:
max(query.shape[-1] != value.shape[-1]) > 32
dtype=torch.float16 (supported: {torch.float32})
bias with non-zero stride not supported
unsupported embed per head: 64
Hi, you could set unpad_inputs=True
together with use_memory_efficient_attention=True
, or set pad_to_multiple_of=8
when tokenizing.
izhx
changed discussion status to
closed
And there is the output of a successful run.
All model checkpoint weights were used when initializing NewModel.
All the weights of NewModel were initialized from the model checkpoint at Alibaba-NLP/gte-large-en-v1.5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use NewModel for predictions without further training.
[[41.875, 77.125, 37.03125]]
0