Max output tokens?
I understand the input context length is 8k, but what about the output?
The output takes the same space as the input. The model completes the 8k token space with the response. You could slide the contxt window and get more output, but then you're losing context at the beginning.
Example, if your input is 100 tokens, you have ~7900 tokens for completion. But if your input is 7900 tokens, you have ~100 tokens left for response until your input starts getting trimmed out. The model can only pay attention to 8k tokens at most.
I understand the input context length is 8k, but what about the output?
May I ask where I can check this "8k context length" configuration for the llama3 model? Thanks!
I understand the input context length is 8k, but what about the output?
May I ask where I can check this "8k context length" configuration for the llama3 model? Thanks!
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/config.json
look at max_position_embeddings
The output takes the same space as the input. The model completes the 8k token space with the response. You could slide the contxt window and get more output, but then you're losing context at the beginning.
Example, if your input is 100 tokens, you have ~7900 tokens for completion. But if your input is 7900 tokens, you have ~100 tokens left for response until your input starts getting trimmed out. The model can only pay attention to 8k tokens at most.
Thanks for you detailed explanation! However, I am wondering if the prompt length was count towards the 8000 tokens limit? As I see "system
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024 <Prompt...>" in my output