CPU VS GPU computation time for Mixtral-8x7B-Instruct-v0.1

#85

by kmukeshreddy - opened Jan 8

Jan 8

I have created a base prompt and set the maximum token limit. I then ran the prompt on both a CPU and a GPU. However, to my surprise, the computation time for the model was the same for both the CPU and GPU runs. I am wondering if anyone else has encountered this result or has any insights on why this might be happening. (Typically, a GPU should perform computations faster than a CPU.)

YaTharThShaRma999

Feb 1

If you are using huggingface, you must move model and input ids to cuda.
Do it with model.cuda() and input_ids.cuda()

kmukeshreddy

Feb 25

Yes, the issue was the complete model was not enough in GPU's which i have.
When i quantized the model the GPU inference is good.

kmukeshreddy changed discussion status to closed Feb 25

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment