--- license: llama3 model-index: - name: Llama-3-8B-ProLong-64k-Base results: - task: type: text-generation name: Text Generation dataset: name: IFEval (0-Shot) type: HuggingFaceH4/ifeval args: num_few_shot: 0 metrics: - type: inst_level_strict_acc and prompt_level_strict_acc value: 12.49 name: strict accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: BBH (3-Shot) type: BBH args: num_few_shot: 3 metrics: - type: acc_norm value: 25.02 name: normalized accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MATH Lvl 5 (4-Shot) type: hendrycks/competition_math args: num_few_shot: 4 metrics: - type: exact_match value: 5.82 name: exact match source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GPQA (0-shot) type: Idavidrein/gpqa args: num_few_shot: 0 metrics: - type: acc_norm value: 4.81 name: acc_norm source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MuSR (0-shot) type: TAUR-Lab/MuSR args: num_few_shot: 0 metrics: - type: acc_norm value: 9.1 name: acc_norm source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU-PRO (5-shot) type: TIGER-Lab/MMLU-Pro config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 25.4 name: accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base name: Open LLM Leaderboard --- # princeton_nlp/Llama-3-8B-ProLong-64k-Base Contributors: Tianyu Gao*, Alexander Wettig* (*equal contribution), Howard Yen, Danqi Chen Contact: `{tianyug, awettig}@princeton.edu` 💡 ProLong stands for **Pr**incet**o**n **Long**-Context! ## The ProLong Series - [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) ← you are here! - [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct) - princeton_nlp/Llama-3-8B-ProLong-512k-Base (soon-to-come) - princeton_nlp/Llama-3-8B-ProLong-512k-Instruct (soon-to-come) ## Features - Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce a long-context instruction-tuned model that can stably handle up to 64K tokens. We also have a version that can process up to 512K tokens. - This model is trained on - 20B carefully curated data mixture of short and long data (max length 64K). This repo is the base model. - For the 512K version, we continue training the base model for 5B more tokens, with a mixture of short, long (64K), and ultra long (512K) data. - Then we fine-tuned them on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability. - On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes. - We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. We will include more details in our soon-to-come technique report. ## Benchmarking results ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/PPSuEMsUWIyrmrOV_88Xf.png) You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing). In this detailed results, we show that our model can retain the original Llama-3's general LM performance (on tasks selected by the [HF Open LLM Leaderboard v1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard)). This is non-trivial in long-context fine-tuning and requires a careful selection of the fine-tuning data mixture and the training configurations. Understanding long-context performance is tricky, as there is no consensus on what’s effective long-context evaluation or how well those existing benchmarks reflect real-world use case. In this work, we curate a combination of existing and new tasks across both synthetic and natural datasets to demonstrate the strength of our model. We divide the tasks into the following categories: - **Recall**: we use synthetic Json key-value retrieval task (lost-in-the-middle, [Liu et al., 2023](https://arxiv.org/pdf/2307.03172); ∞BENCH, [Zhang et al., 2024](https://arxiv.org/pdf/2402.13718)) to test the model’s ability to retrieve arbitrary information from the context. This is a more comprehensive and reliable version of [needle-in-a-haystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack). - **Retrieval-augmented generation (RAG)**: we use existing open-domain question answering datasets in a multi-document QA format ([Liu et al., 2023](https://arxiv.org/pdf/2307.03172)). Datasets we select include NaturalQuestion ([Kwiatkowski et al., 2019](https://aclanthology.org/Q19-1026.pdf)), HotpotQA (Y[ang et al., 2018](https://arxiv.org/pdf/1809.09600)), and PopQA ([Mallen et al., 2023](https://arxiv.org/pdf/2212.10511)). The gold document is put to different positions to test “lost-in-the-middle”. - **In-context learning (ICL)**: ICL tasks have been established to evaluate long-context abilities ([Li et al., 2024](https://arxiv.org/pdf/2404.02060); [Bertsch et al., 2024](https://arxiv.org/pdf/2405.00200)). We follow [Bertsch et al., 2024](https://arxiv.org/pdf/2405.00200) and use the following five tasks: TREC, TREC-fine ([Hovy et al., 2001](https://aclanthology.org/H01-1069.pdf)), NLU ([Liu et al., 2019](https://arxiv.org/pdf/1903.05566)), Banking-77 ([Casanueva et al., 2020](https://aclanthology.org/2020.nlp4convai-1.5.pdf)), Clinc-150 ([Larson et al., 2019](https://aclanthology.org/2020.nlp4convai-1.5.pdf)). - **Reranking**: Given a query and a number of retrieved passages (by an off-the-shelf model), reranking requires the model to generate the IDs of the top-10 passages. This has been shown to be a realistic application ([Sun et al., 2023](https://arxiv.org/pdf/2304.09542)) and is also challenging, as it requires reasoning/comparison across documents. We use MSMARCO ([Bajaj et al., 2018](https://arxiv.org/pdf/1611.09268)) for this task. - **Long-document QA/summarization**: These are the most straightforward applications. We selected some of the public tasks that have the longest documents, including NarrativeQA ([Kočiský et al., 2017](https://arxiv.org/pdf/1712.07040)), Qasper ([Dasigi et al., 2021](https://arxiv.org/pdf/2105.03011)), QMSum ([Zhong et al., 2021](https://arxiv.org/pdf/2104.05938)), and Multi-LexSum ([Shen et al., 2022](https://arxiv.org/pdf/2206.10883)). As traditional evaluation metrics like rouge or F1 do not reflect the performance well, we use GPT-4o to score the model output given the gold output and the question.
Find details about our GPT-4o rubrics for the long-document QA/summarization tasks. We use the following prompt to evaluate NarrativeQA and Qasper: ``` Please act as an impartial judge and evaluate the quality of the provided answer which attempts to answer the provided question based on a provided context. Although you are not given the context, you will be given a set of correct answers that achieves full scores on all metrics, and you need to assess the provided answers using the correct answers. Below is your grading rubric: Fluency: - Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete answers, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0. - Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct answers. Correctness: - Score 0 (Incorrect): The answer does not agree with the provided correct answers at all. - Score 1 (partly correct): Partly agree with one of the provided correct answers (for example, the question asks for a date and a person; the answer gets the date right but the person wrong). - Score 2 (correct but not fully relevant): Fully agrees with one of the provided correct answers but mentions other completely irrelevant information. Note that extra details provided in the answer, even if not mentioned in the correct answers, should NOT be seen as irrelevant as long as they are relevant to the question to a reasonable extend. - Score 3 (correct and relevant): Fully agrees with one of the provided correct answers and only provides information relevant to the question. Note that if the answer is longer than the correct answer, as long as everything in the answer is relevant to the question, it should still be given score 3. For example, if the correct answer is "the North Pole" and the answer is "They are headed for the North Pole", it should still be given a score of 3. Now, read the following question, answer, and correct answers. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}. Question: {question} Correct answers: {correct_answers} Answer: {output} ``` For QMSum: ``` Please act as an impartial judge and evaluate the quality of the provided summary with respect to a summarization inquiry based on a meeting transcript. Although you are not given the transcript, you will be given a reference summary that achieves full scores on all metrics, and you need to assess the provided summary using the reference one. Below is your grading rubric: Fluency: - Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete sentences, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0. - Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct summaries. Correctness: - Score 0 (Incorrect): The summary does not agree (have overlap) with the reference summary at all. - Score 1 (<=30% correct): Covers less than 30% of the reference summary. - Score 2 (<=80% correct): Covers 30%-80% of the reference summary. - Score 3 (>80% correct, but not fully relevant): Covers more than 80% of the reference summary, but mentions other completely irrelevant information. Note that extra details provided in the summary, even if not mentioned in the reference summary, should NOT be seen as irrelevant as long as they are relevant to the query to a reasonable extend. - Score 4 (>80% correct and relevant): Almost fully agrees with the reference and only provides information relevant to the question. Now, read the following question, reference summary, and provided summary. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}. Question: {question} Reference summary: {correct_answers} Provided summary: {output} ``` Multi-LexSum ``` Please act as an impartial judge and evaluate the quality of the provided summary of a civil lawsuit. The summary is based on a set of legal documents, and it should contain a short description of the background, parties invovled, and the outcomes of the case. You are not given the entirety of the legal documents, but you will be given with expert-written summaries to help you evaluate the quality of the provided summary. The expert-written summaries come in two forms: the short expert summary contains all the relevant information that the provided summary should contain, and the long expert summary contains other relevant information that the provided summary may or may not contain. Below is your grading rubric: Fluency: - Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete answers, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0. - Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct answers. Correctness: - Score 0 (Incorrect): The summary does not agree with the information provided in the expert summaries at all. The summary either does not contain any information or only contains irrelevant or incorrect information. - Examples: - Expert short summary: "This case is about an apprenticeship test that had a disparate impact on Black apprenticeship applicants." - Provided summary: "This case is about a lawsuit filed by the EEOC against a company for discrimination against Asian employees." - Score 1 (<=30% correct): Covers less than 30% of the expert short summary. - Score 2 (<=80% correct): Covers 30%-80% of the expert short summary. - Score 3 (>80% correct, but irrelevant or incorrect information found): Covers more than 80% of the expert short summary, but mentions other completely irrelevant information or incorrect information. - Irrelevant information is those that are not relelvant to the case and are not found in the expert short/long summaries. - Incorrect information is those that are factually incorrect or in conflict with the expert summaries. - Score 4 (>80% correct and relevant): The provided summary contains almost all major points found in the expert short summary and does not contain any irrelevant information. Now, read the provided summary and expert summaries, and evaluate the summary using the rubric. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}. Expert long summary: {long_expert_summary} Export short summary: {short_expert_summary} Provided summary: {output} ``` We get both a fluency score (0/1) and a correctness score (0-3 for QA and 0-4 for summarization). The final score is fluency * correctness (think fluency as a “prerequisite”), normalized to 0-100.
Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report. The evaluation code will be available [here](https://github.com/princeton-nlp/ProLong).
Some more details about the evaluation. - All the evaluation context length is determined by the llama-2 tokenizer to accommodate models with smaller vocabularies. - For Json KV and RAG, we randomly sample positions of the target key-value pairs or the passages to test “lost-in-the-middle”. - For ICL, we use abstract labels (0,1,2,3…) instead of natural language labels ([Pan et al., 2023](https://arxiv.org/pdf/2305.09731)) to evaluate models’ ability to learn new tasks. - We use greedy decoding for all models/tasks.
## Efficient training techniques We integrate several pieces of efficient training techniques in producing our models: - We use [FlashAttention-2 (Dao et al., 2023)](https://github.com/Dao-AILab/flash-attention)’s variable length attention and stop the attention across document boundaries. We combine variable length attention with smart batching (batching sequences with similar lengths in one step) and achieve significant speedup. - To handle Llama-3’s large vocabulary and avoid the memory overhead from materializing a huge logit matrix, we computing the cross entropy loss in chunks of 8192 tokens. - For training the 512K model, we adapt [DeepSpeed-Ulysses (Jacobs et al., 2023)](https://www.deepspeed.ai/tutorials/ds-sequence/) for sequence parallelism. ## Stage 1: long-context training We used the following data mixture and traind [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) for 20B tokens.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/JMKoID3e6Xd7MfJtfM5Nm.png)
| Data sources | |:----- | | [Books](https://huggingface.co/datasets/cerebras/SlimPajama-627B) Only 64k tokens | | [Textbooks](https://arxiv.org/pdf/2402.11111) Chapters concat. by book and topic | | [The Stack V1](https://huggingface.co/datasets/bigcode/the-stack) Source files concat. by repo; only 64K tokens | | [StackExchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | | [Tulu-v2](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | | [Wikipedia](https://allenai.github.io/dolma/) | | [Arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) | | [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) | | [FineWeb-EDU](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) |
For the stack v1, we concatenate all the files from the same repo (a strategy introduced by [DeepSeek Coder; Guo et al., 2024](https://github.com/deepseek-ai/DeepSeek-Coder)). For the stack v1 and books, we only keep documents that are longer than 64K. We use the following hyperparameters: | Name | Hyperparameter | |:------- |:------- | | Batch size | 4M tokens | | Peak learning rate | 1e-5 | | Scheduling | 5% warmup, cosine decay till 10% peak learning rate | | Total #tokens | 20B | | Rope theta | 8M | In our preliminary experiments, we found that - One of the challenges in long-context training is to preserve the general LM performance. - At the beginning of training, the general LM performance degrades potentially due to optimizer state warmup, data mixture mismatch, and the length extension. - We found that the right rope theta + the right data mixture + longer training + low LR help alleviate this problem. - Using variable length attention and stopping attention across document boundaries helps preserve the general LM performance. - Other warmup schemes like progressively increasing the length (similar to [LWM; Liu et al., 2024](https://arxiv.org/pdf/2402.08268)) do not seem to provide more benefit in our experiments. We will release more details of our ablations in our technical report! ## Stage 2: instruction tuning We conduct supervised fine-tuning (SFT) on our base long-context model. In our preliminary experiments, we found that using [UltraChat (Ding et al., 2023)](https://huggingface.co/datasets/stingning/ultrachat) leads to the best long-context results (among [UltraChat](https://huggingface.co/datasets/stingning/ultrachat), [Tulu (Wang et al., 2023)](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture), and [ShareGPT](https://sharegpt.com/)). Note that this only reflects the performance on our benchmark, not representing the overall quality of those datasets. The hyperparameters we used for SFT are as follows: | Name | Hyperparameter | |:------- |:------- | | Batch size | 4M tokens | | Peak learning rate | 2e-5 | | Scheduling | 5% warmup, cosine decay till 10% peak learning rate | | Total #tokens | 1B | - Synthetic data: we also experiment with several strategies to generate long, synthetic chat data, but they have not yet helped to improve upon our UltraChat-fine-tuned chat models. The synthetic data strategies we tried include (1) using a paragraph of a long book/repo to generate question-answer pairs; (2) using hierarchical methods to summarize a long book; (3) turning the previous synthetic long QA data into a RAG format. # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_princeton-nlp__Llama-3-8B-ProLong-64k-Base) | Metric |Value| |-------------------|----:| |Avg. |13.77| |IFEval (0-Shot) |12.49| |BBH (3-Shot) |25.02| |MATH Lvl 5 (4-Shot)| 5.82| |GPQA (0-shot) | 4.81| |MuSR (0-shot) | 9.10| |MMLU-PRO (5-shot) |25.40|