Inference API doesn't seem to support 100k context window
Hi,
I am trying to use HF's inference API to interact with the model from a gradio app. For larger inputs, I receive a validation error: "Input validation error: inputs
tokens + max_new_tokens
must be <= 8192". Is this a limitation on this HF implementation or am I using the inference API wrong? From the blog post I read that CodeLlama should support up to 100k tokens in the input. How to achieve that with this model?
You have to extend the context window using ROPE.
text-generation-launcher --model-id $MODEL_ID --rope-scaling dynamic --max-input-length 16384 --max-total-tokens 32768 --max-batch-prefill-tokens 16384 --hostname 0.0.0.0 --port 3000
Hi,
I am trying to use HF's inference API to interact with the model from a gradio app. For larger inputs, I receive a validation error: "Input validation error:
inputs
tokens +max_new_tokens
must be <= 8192". Is this a limitation on this HF implementation or am I using the inference API wrong? From the blog post I read that CodeLlama should support up to 100k tokens in the input. How to achieve that with this model?
I am also having this problem, am trying to use Langchain.
I'm having the same issue. Anybody have any insight? Is this configurable, or is it a hard limit through the Inference API model?