I'm running out of memory while generating on a RTX A5000 (24 GB)

#83
by xsanskarx - opened

It runs out of memory every time

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline

model_name_or_path = "microsoft/Phi-3-mini-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens = 1048,
return_full_text= False,
temperature = 0.3,
do_sample = True,
)

llm = HuggingFacePipeline(pipeline=pipe)
torch.cuda.empty_cache()

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever = ensemble_retriever,
callbacks=[handler],
chain_type_kwargs={"prompt": custom_prompt},
return_source_documents=True
)

Install flash-attn
!pip install flash-attn --no-build-isolation

Add 'attn_implementation="flash_attention_2"' to your AutoModelForCausalLM.from_pretrained arguments.
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True, attn_implementation="flash_attention_2")

Flash Attention helps reduce memory usage, it helped reduce my VRAM usage by about 10 Gigs when Quantizing with GPTQ.

Also facing a similar issue with a V100 GPU.

I tried attn_implementation="flash_attention_2" as suggested but I am getting the following error:

ValueError: The current flash attention version does not support sliding window attention.

Based on my research you are supposed to install flash-attn sepearately but I already did that (and restarted my kernel) and still getting the error.

Name: flash-attn
Version: 2.5.9.post1

from transformers.utils import is_flash_attn_2_available
is_flash_attn_2_available()
True

Uninstall and reinstall flash-attn
pip list | grep flash
pip uninstall ...
pip install flash-attn --no-build-isolation

Looking at the Phi3 modeling implementation in transformers, it seems it may be failing due to Phi3 not being compatible with output_attentions

# Phi3FlashAttention2 attention does not support output_attentions

        if not _flash_supports_window_size:
            logger.warning_once(
                "The current flash attention version does not support sliding window attention. Please use `attn_implementation='eager'` or upgrade flash-attn library."
            )
            raise ValueError("The current flash attention version does not support sliding window attention.")

Are you trying to run Phi3 Causally or some other method like Seq2Seq?

Thank you I loaded an entirely new Kernel and got that part resolved but then discovered that my nVidia V100 GPU is not supported by Flash Attention.

I am using Phi3 causally for this. Well, I guess I will continue researching on my own and perhaps open a new conversation as I don't want to hijack OPs post.

nguyenbh changed discussion status to closed

Sign up or log in to comment