Memory Error While Fine-tuning AYA on 8 H100 GPUs

#23

by ArmanAsq - opened Apr 30, 2024

Apr 30, 2024

Hello,

I am currently trying to fine-tune an AYA model on 8 H100 GPUs, but I'm encountering a memory error. My system has 640 GB of GPU RAM, which I assumed would be sufficient for this task. I'm not using PEFT or LoRA, and my batch size is set to 1.
I'm wondering if anyone has encountered a similar issue and could provide some guidance. How many GPUs are typically recommended for this task? Any help would be greatly appreciated.

Thanks in advance!

shivi

Cohere For AI org Jun 25, 2024

Hey @ArmanAsq

I think I answered your question on our Discord so closing this one for now :)

shivi changed discussion status to closed Jun 25, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment