Possibility that Claude/ChatGPT uses similar techniques on adjusting RoPE sampling rate?
This technique sounds simply "too easy" to get such a good result. I bet LLM leaders like Anthropic/OpenAI has already discovered it, or not.
Just like how the Stable Diffusion community has discovered the "offset noise" technique that works on input noise distribution so that the model can extend to HDR images on which the model is not pretrained on. The Stable Diffusion community has speculated that Midjounry v4 has already using this technique to achieve HDR-looking images.
But, there is paper that points out that "offset noise" does not address the fundamental distribution truth of HDR images, it's just a easy trick.
I don't think sampling RoPE actually extends the capacity of the model in terms of PPL
As the 8K model performs worse than the base model at 2048 tokens, and the 8k model simply reach about the same ppl at 8192 tokens as the base model would at 2048 tokens (which the 8k model is pretrained on)
Nonetheless, some prompting techniques like CoT and ReAct that make the model deliberately spit out more tokens seem to increase the model's N-Shot abilities across many tasks. I think we need Humman feedbacks to judge the actual benefits of RoPE sampling .
@Yhyu13 Hello, I think your confusion is that you are relying on that chart to give you the answer, I encourage you to finetune a model and run perplexity test on your own model, rather than use the chart. This chart is a very specific case and does not apply, it is only meant to show that the perplexity is decreasing as sequence length increases.
For example, refer to the next chart on the page :
Here you can see the technique works and does not perform any worse than the base model. For more in-depth benchmarking, refer to Meta AI's paper: https://arxiv.org/pdf/2306.15595.pdf
And as always you should perform the test on your own to verify the results, not just consider any charts or tables that others posted.
As the 8K model performs worse than the base model at 2048 tokens, and the 8k model simply reach about the same ppl at 8192 tokens as the base model would at 2048 tokens (which the 8k model is pretrained on)
That is because the scaling factor used there is for the purpose of extrapolation. To get results on-par with the base model you need to use a factor of 0.5, 0.25 is used because there is minimal perplexity loss when extrapolating, but when the proper factor is used, there is no loss.
The formula is pretrained length / cutoff length
, in this case it is 2048 / 4096
as 4096 was the cutoff used for my models. If you train with cutoff of 8192, then you would use 2048 / 8192
and you will be able to use 8192 sequence length with no perplexity loss, however you can go to 16384 with minimal perplexity loss. Does that make sense?