You are all happy π that @meta-llama released Llama 3.
Then you are sad π that it only has a context length of 8k.
Then you are happy π that you can just scale llama-3 PoSE to 96k without training, only needing to modify max_position_embeddings and rope_theta.
But then you are sad π’ it only improves the model's long-context retrieval performance (i.e., finding needles) while hardly improving its long-context utilization capability (doing QA and summarization).
But then you are happy π that the @GradientsTechnologies community has released the long-context Llama-3-8B-Instruct-262K with long context (262k-1M+).
Now we have another paper "Extending Llama-3's Context Ten-Fold Overnight" π.
The context length of Llama-3-8B-Instruct is extended from 8K to 80K using QLoRA fine-tuningβοΈ.
The training cycle is highly efficient, taking "only" π 8 hours on a single 8xA800 (80G) GPU machine.
The model also preserves its original capability over short contexts. β
The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4.π
The paper suggests that the context length could be extended far beyond 80K with more computation resources (π GPU-poor).
The team plans to publicly release all resources, including data, model, data generation pipeline, and training code, to facilitate future research from the β€οΈ community.