some issues
When I try to load the model using VLLM, it consumes all of my memory (128G) and throws an Out-of-Memory (OOM) error. The pipeline from transformers library can be used, but the inference results are abnormal.
Is this OOM for CPU or GPU? It should work fine. I have the inference code in https://github.com/TIGER-AI-Lab/MAmmoTH/blob/main/requirements.txt.
It is CPU OOM. The memory consumption keeps growing and eventually consumes all of my 128GB memory, which shouldn't be the case. Other models from Mammoth don't exhibit this issue.
I see. Normally inference won't take that much memory. Can you confirm whether it's from vllm or from the mistral model itself (huggingface transformers)? I think these two would be mainly the sources.
I tried using huggingface's transformers instead of vllm, and I did encounter the same out-of-memory issue.