GPU requirement for hosting this model?

by csgxy2022 - opened May 1, 2024

csgxy2022

May 1, 2024

Two A100 GPUs, trying to host this model but got OOM issue.

docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=xxx" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model gradientai/Llama-3-8B-Instruct-Gradient-1048k --tensor-parallel-size 2

I got:

torch.cuda.OutOfMemoryError: CUDA out of memory

csgxy2022

May 1, 2024

No problem hosting the original Llma-3-8B-Instruct model

fhsp93

May 3, 2024

This comment has been hidden

michaelfeil

Gradient AI org May 4, 2024

•

edited May 4, 2024

The following should do the job for vllm! (A100.x2)

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --shm-size=8g \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model gradientai/Llama-3-8B-Instruct-1048k \
    --tensor-parallel-size 2 \
    --max-model-len 65536

vLLM is not optimal, it would require around ~1000GB vRAM (IIRC) to serve a model with this hidden dim and ctx length.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment