Problems with flash-attention2
According to the model card
If you want faster inference using flash-attention2, you need to install these dependencies:
pip install packaging ninja
pip install flash-attn==v2.1.1 --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git@v2.1.1#subdirectory=csrc/rotary
Now flash-attention2 seems to be mandatory because in modeling_flash_llama.py it is written
try:
from flash_attn.flash_attn_interface import (
flash_attn_kvpacked_func,
...
except ImportError:
flash_attn_v2_installed = False
raise ImportError('Please install Flash Attention: `pip install flash-attn --no-build-isolation`')
Question 1) Is it possible to use leo-hessianai-13b-chat (and 7b-chat) without flash-attention2? How?
When I run the above pip install commands with the recent torch version: 2.2 / cuda version: cu121,
there is a library mismatch with flash-attn==v2.1.1 with an undefined symbol
import torch; from flash_attn import flash_attn_func
Traceback (most recent call last):
import flash_attn_2_cuda as flash_attn_cuda
site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: at::_ops::_pad_enum::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, long, c10::optional<double>)
(the error output was passed through c++filt)
Question 2) with what pytorch/cuda version was leo-hessianai generated? What versions of flash-attn2 can be used? Are you using cxx11abiFALSE or cxx11abiTRUE (see https://github.com/Dao-AILab/flash-attention/issues/457 )?
I tried the latest flash-attn==2.5.1.post1 and a couple of earlier pytorch/cuda versions, but without success (a similar issue is mentioned in https://github.com/Dao-AILab/flash-attention/issues/836, but I do not run docker; a similar issue is mentioned in https://github.com/Dao-AILab/flash-attention/issues/667#issuecomment-1816039443 but I did not suceed using the versions mentioned there)
By now you just need to use trust_remote_code=False
or leave the argument away entirely. This model is full, compatible with the flash attention implementation in transformers.