Out of CPU memory in pipeline
Hi!
The model files itself are small and should fit in 24Gb of GPU.
But if I try the suggested pipeline:
pipe = pipeline("text-generation", model="relaxml/Llama-2-70b-E8P-2Bit")
it start to grow the CPU memory till 128Gb and then be killed on OOM.
Can I avoid this memory allocation?
I'm confused, what is this suggested pipeline? I don't think we have any code in our codebase that uses a pipeline()
call.
Hm... "Model card" tab, most right button above "Downloads" chart "Use in Transformers" :)
Do you offer a working example of code?
Yes, in our github repo https://github.com/Cornell-RelaxML/quip-sharp. We use a modified version of the modeling_llama.py file to handle our quantized linear layers, which is why calling the default "pipeline" command without using our repo will not work.
Thank You!