CPU VS GPU computation time for Mixtral-8x7B-Instruct-v0.1
#85
by
kmukeshreddy
- opened
I have created a base prompt and set the maximum token limit. I then ran the prompt on both a CPU and a GPU. However, to my surprise, the computation time for the model was the same for both the CPU and GPU runs. I am wondering if anyone else has encountered this result or has any insights on why this might be happening. (Typically, a GPU should perform computations faster than a CPU.)
If you are using huggingface, you must move model and input ids to cuda.
Do it with model.cuda() and input_ids.cuda()
Yes, the issue was the complete model was not enough in GPU's which i have.
When i quantized the model the GPU inference is good.
kmukeshreddy
changed discussion status to
closed