Issue with multi GPU inference.

by eastwind - opened Jun 2, 2023

Jun 2, 2023

Just tested the model, looks good. But it seems you have inherited an issue from the base falcon. When inferencing over multiple gpus I get gibberish unless I pass use_caching=False in the model.generate function. Not sure why this happens.

daryl149

Jun 3, 2023

•

edited Jun 3, 2023

Same issue with the open assistant rlhf llama model on multi gpu. I can't test it right now, but if your flag fixes it, I think it's a bitsandbytes issue. Because for me it only gave the errors with load_in_8bit=true

use_cache=False does not fix it for that one :(

eastwind

Jun 4, 2023

I didn't use quantisation for the falcon. I just loaded it in 4 v100s.

WeWrite

Jun 10, 2023

Eastwind? How were you able to harness all GPUs for a single prompt? I tried with a 8xA100x80G machine and it only used 1 GPU crashing the model for lack of memory.. can you share config and code? Pretty please...? THX.

eastwind

Jun 10, 2023

I tried doing it with device_map = auto . It works with cache disabled. But I think for multigpu as the falcon authors replied to my original post. They said to use the huggingface text inference hosting solution. I haven't tested it out however

WeWrite

Jun 11, 2023

Thx. Unfortunately that hosting solution is no longer available.. do you know what was its configuration?

eastwind

Jun 11, 2023

See this issue that I also made lol. I haven't tested but given the fact that they have it working. I would assume it works. https://github.com/huggingface/text-generation-inference/issues/417

WeWrite

Jun 11, 2023

Thanks :-)

vinwizard

Jul 5, 2023

Just tested the model, looks good. But it seems you have inherited an issue from the base falcon. When inferencing over multiple gpus I get gibberish unless I pass use_caching=False in the model.generate function. Not sure why this happens.

Eastwind, could you please tell how fast the inference was? Because for small prompts on the V100 also it is taking me a good minute to render the response and the longer prompts it is crashing with a CUDA OOM error.

RecViking

Jul 5, 2023

Eastwind, could you please tell how fast the inference was? Because for small prompts on the V100 also it is taking me a good minute to render the response and the longer prompts it is crashing with a CUDA OOM error.

I've noticed Falcon 40B works fine in 16bit (bfloat) mode. When running it in 8bit, it runs like garbage and is CPU bound. The performance is HORRIBLE in anything but 16/32bit and it's always CPU bound. Running in 16/32bit, it uses my cards sequentially and runs them up to 90% utilization and then pops to the next card. I've got 3x 48gb A6000 cards. When loaded in 16bit, it takes up about 30gb on each of the cards and during inference, this can climb as high as 45gb per card. This model is resource hungry and does no operate in 8bit or 4bit quantized well at all.

daryl149

Jul 19, 2023

•

edited Jul 19, 2023

That definitely is in line with the performance I have seen on a 4xV100S cluster (128GB combined VRAM). Using 8bit and 4 bit changes the model size, but does not speed up inference. (0.75 token per second)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment