flash Attention Error while inference

#7
by hackint0sh - opened

Getting the error

AssertionError: Flash Attention is not available, but is needed for dense attention
currently using the Nvidia A10G GPU

Library Installed
!pip install git+https://github.com/huggingface/transformers
!pip install tiktoken==0.6.0 triton==2.3.0

other Lib
!pip install einops accelerate bitsandbytes

Currently running into this as well (running on 4 a100s).

Actively installing flash-attn to see if this fixes it (but can't get ninja to work for fast compile time so its slow).
https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features

(need cuda, gcc, etc for flash-attn)

Currently running into this as well (running on 4 a100s).

Actively installing flash-attn to see if this fixes it (but can't get ninja to work for fast compile time so its slow).
https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features

(need cuda, gcc, etc for flash-attn)

Was able to get flash-attention to install with correct pytorch version.
https://pytorch.org/get-started/previous-versions/#v210

Microsoft org

Did installing flash-attention fix the issue ?

Yes. I am able to run the Phi-3-small model now. I was also able to get ninja installed to reduce build time for flash-attention to <3 minutes (12 cores, 120GB RAM).

Microsoft org

Awesome ! Closing this out for now.
Adding the torch sdpa attention to remove the hard dependency on flash-attention is one of the things that we can subsequently follow up with if this becomes a big issue for adoption. We are glad for your interest in phi3-small, and hope you find it useful !

bapatra changed discussion status to closed

Sign up or log in to comment