Deployment Tips

by TrialAccountHF - opened May 2

May 2

Hi I tried running this via an inference endpoint A100x2 on AWS, I noticed it stopped mid-sentence after only 50 to 100 words. Do you have any suggestions to fix the problem?

What do you recommend in terms of Quantization, Max Input Length (per Query), Max Number of Tokens (per Query), Max Batch Prefill Tokens, Max Batch Total Tokens?

MrHugs

Jul 1

@TrialAccountHF did you manage to get this working? As a basic step, I tried using the model in a Colab notebook but ran out of disk space.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment