Deployment Tips

#1
by TrialAccountHF - opened

Hi I tried running this via an inference endpoint A100x2 on AWS, I noticed it stopped mid-sentence after only 50 to 100 words. Do you have any suggestions to fix the problem?

What do you recommend in terms of Quantization, Max Input Length (per Query), Max Number of Tokens (per Query), Max Batch Prefill Tokens, Max Batch Total Tokens?

@TrialAccountHF did you manage to get this working? As a basic step, I tried using the model in a Colab notebook but ran out of disk space.

Sign up or log in to comment