Cuda Out of memory issue when deploying mistralai/Mixtral-8x7B-Instruct-v0.1 on AWS "ml.g5.48xlarge"

#139
by sonalisbapte - opened

Hello all,

I am a professional AI engineer. I am using the mentioned LLM model on jumpstart and I can produce the responses in average 5 seconds even after enabling all 8 GPUS provided by "ml.g5.48xlarge". My requirement is not further reduce the response time e.g. less than a second.

For this purpose I planned to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 using inference.py file and mentioning device name as CUDA and using same "ml.g5.48xlarge" Ec2 instance to deploy on aws. I am using Sagemaker to write all the code.

Below is the error I am getting:
2024-02-14 T06:26:16,524 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 1.88 GiB already allocated; 115.12 MiB free; 1.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Below are some options I tried :

  1. setting PYTORCH_CUDA_ALLOC_CONF using max_split_size_mb
    e.g. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"
    tried max_split_size_mb = 64,128,512,1024
  2. setting PYTORCH_CUDA_ALLOC_CONF as all type of memory management technique e.g. "heuristic"
    as suggested in below blog
    https://iamholumeedey007.medium.com/memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130
  3. Using torch.cuda.empty_cache() in the inference script
    https://community.databricks.com/t5/machine-learning/torch-cuda-outofmemoryerror-cuda-out-of-memory/td-p/9651

Any kind of help or references are really appreciated. Looking forward to it. Thanks

Hello all,

I am a professional AI engineer. I am using the mentioned LLM model on jumpstart and I can produce the responses in average 5 seconds even after enabling all 8 GPUS provided by "ml.g5.48xlarge". My requirement is not further reduce the response time e.g. less than a second.

For this purpose I planned to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 using inference.py file and mentioning device name as CUDA and using same "ml.g5.48xlarge" Ec2 instance to deploy on aws. I am using Sagemaker to write all the code.

Below is the error I am getting:
2024-02-14 T06:26:16,524 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 1.88 GiB already allocated; 115.12 MiB free; 1.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Below are some options I tried :

  1. setting PYTORCH_CUDA_ALLOC_CONF using max_split_size_mb
    e.g. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"
    tried max_split_size_mb = 64,128,512,1024
  2. setting PYTORCH_CUDA_ALLOC_CONF as all type of memory management technique e.g. "heuristic"
    as suggested in below blog
    https://iamholumeedey007.medium.com/memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130
  3. Using torch.cuda.empty_cache() in the inference script
    https://community.databricks.com/t5/machine-learning/torch-cuda-outofmemoryerror-cuda-out-of-memory/td-p/9651

Any kind of help or references are really appreciated. Looking forward to it. Thanks

Same problem for me ...

Sign up or log in to comment