---
license: llama3
model-index:
- name: Llama-3-8B-ProLong-64k-Base
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 12.49
name: strict accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 25.02
name: normalized accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 5.82
name: exact match
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 4.81
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 9.1
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 25.4
name: accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-64k-Base
name: Open LLM Leaderboard
---
# princeton_nlp/Llama-3-8B-ProLong-64k-Base
Contributors: Tianyu Gao*, Alexander Wettig* (*equal contribution), Howard Yen, Danqi Chen
Contact: `{tianyug, awettig}@princeton.edu`
💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
## The ProLong Series
- [princeton_nlp/Llama-3-8B-ProLong-64k-Base](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Base) ← you are here!
- [princeton_nlp/Llama-3-8B-ProLong-64k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct)
- princeton_nlp/Llama-3-8B-ProLong-512k-Base (soon-to-come)
- princeton_nlp/Llama-3-8B-ProLong-512k-Instruct (soon-to-come)
## Features
- Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce a long-context instruction-tuned model that can stably handle up to 64K tokens. We also have a version that can process up to 512K tokens.
- This model is trained on
- 20B carefully curated data mixture of short and long data (max length 64K). This repo is the base model.
- For the 512K version, we continue training the base model for 5B more tokens, with a mixture of short, long (64K), and ultra long (512K) data.
- Then we fine-tuned them on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability.
- On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
- We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. We will include more details in our soon-to-come technique report.
## Benchmarking results
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/PPSuEMsUWIyrmrOV_88Xf.png)
You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing). In this detailed results, we show that our model can retain the original Llama-3's general LM performance (on tasks selected by the [HF Open LLM Leaderboard v1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard)). This is non-trivial in long-context fine-tuning and requires a careful selection of the fine-tuning data mixture and the training configurations.
Understanding long-context performance is tricky, as there is no consensus on what’s effective long-context evaluation or how well those existing benchmarks reflect real-world use case. In this work, we curate a combination of existing and new tasks across both synthetic and natural datasets to demonstrate the strength of our model.
We divide the tasks into the following categories:
- **Recall**: we use synthetic Json key-value retrieval task (lost-in-the-middle, [Liu et al., 2023](https://arxiv.org/pdf/2307.03172); ∞BENCH, [Zhang et al., 2024](https://arxiv.org/pdf/2402.13718)) to test the model’s ability to retrieve arbitrary information from the context. This is a more comprehensive and reliable version of [needle-in-a-haystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack).
- **Retrieval-augmented generation (RAG)**: we use existing open-domain question answering datasets in a multi-document QA format ([Liu et al., 2023](https://arxiv.org/pdf/2307.03172)). Datasets we select include NaturalQuestion ([Kwiatkowski et al., 2019](https://aclanthology.org/Q19-1026.pdf)), HotpotQA (Y[ang et al., 2018](https://arxiv.org/pdf/1809.09600)), and PopQA ([Mallen et al., 2023](https://arxiv.org/pdf/2212.10511)). The gold document is put to different positions to test “lost-in-the-middle”.
- **In-context learning (ICL)**: ICL tasks have been established to evaluate long-context abilities ([Li et al., 2024](https://arxiv.org/pdf/2404.02060); [Bertsch et al., 2024](https://arxiv.org/pdf/2405.00200)). We follow [Bertsch et al., 2024](https://arxiv.org/pdf/2405.00200) and use the following five tasks: TREC, TREC-fine ([Hovy et al., 2001](https://aclanthology.org/H01-1069.pdf)), NLU ([Liu et al., 2019](https://arxiv.org/pdf/1903.05566)), Banking-77 ([Casanueva et al., 2020](https://aclanthology.org/2020.nlp4convai-1.5.pdf)), Clinc-150 ([Larson et al., 2019](https://aclanthology.org/2020.nlp4convai-1.5.pdf)).
- **Reranking**: Given a query and a number of retrieved passages (by an off-the-shelf model), reranking requires the model to generate the IDs of the top-10 passages. This has been shown to be a realistic application ([Sun et al., 2023](https://arxiv.org/pdf/2304.09542)) and is also challenging, as it requires reasoning/comparison across documents. We use MSMARCO ([Bajaj et al., 2018](https://arxiv.org/pdf/1611.09268)) for this task.
- **Long-document QA/summarization**: These are the most straightforward applications. We selected some of the public tasks that have the longest documents, including NarrativeQA ([Kočiský et al., 2017](https://arxiv.org/pdf/1712.07040)), Qasper ([Dasigi et al., 2021](https://arxiv.org/pdf/2105.03011)), QMSum ([Zhong et al., 2021](https://arxiv.org/pdf/2104.05938)), and Multi-LexSum ([Shen et al., 2022](https://arxiv.org/pdf/2206.10883)). As traditional evaluation metrics like rouge or F1 do not reflect the performance well, we use GPT-4o to score the model output given the gold output and the question.
Find details about our GPT-4o rubrics for the long-document QA/summarization tasks.
We use the following prompt to evaluate NarrativeQA and Qasper:
```
Please act as an impartial judge and evaluate the quality of the provided answer which attempts to answer the provided question based on a provided context.
Although you are not given the context, you will be given a set of correct answers that achieves full scores on all metrics, and you need to assess the provided answers using the correct answers.
Below is your grading rubric:
Fluency:
- Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete answers, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0.
- Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct answers.
Correctness:
- Score 0 (Incorrect): The answer does not agree with the provided correct answers at all.
- Score 1 (partly correct): Partly agree with one of the provided correct answers (for example, the question asks for a date and a person; the answer gets the date right but the person wrong).
- Score 2 (correct but not fully relevant): Fully agrees with one of the provided correct answers but mentions other completely irrelevant information. Note that extra details provided in the answer, even if not mentioned in the correct answers, should NOT be seen as irrelevant as long as they are relevant to the question to a reasonable extend.
- Score 3 (correct and relevant): Fully agrees with one of the provided correct answers and only provides information relevant to the question. Note that if the answer is longer than the correct answer, as long as everything in the answer is relevant to the question, it should still be given score 3. For example, if the correct answer is "the North Pole" and the answer is "They are headed for the North Pole", it should still be given a score of 3.
Now, read the following question, answer, and correct answers. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}.
Question: {question}
Correct answers: {correct_answers}
Answer: {output}
```
For QMSum:
```
Please act as an impartial judge and evaluate the quality of the provided summary with respect to a summarization inquiry based on a meeting transcript.
Although you are not given the transcript, you will be given a reference summary that achieves full scores on all metrics, and you need to assess the provided summary using the reference one.
Below is your grading rubric:
Fluency:
- Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete sentences, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0.
- Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct summaries.
Correctness:
- Score 0 (Incorrect): The summary does not agree (have overlap) with the reference summary at all.
- Score 1 (<=30% correct): Covers less than 30% of the reference summary.
- Score 2 (<=80% correct): Covers 30%-80% of the reference summary.
- Score 3 (>80% correct, but not fully relevant): Covers more than 80% of the reference summary, but mentions other completely irrelevant information. Note that extra details provided in the summary, even if not mentioned in the reference summary, should NOT be seen as irrelevant as long as they are relevant to the query to a reasonable extend.
- Score 4 (>80% correct and relevant): Almost fully agrees with the reference and only provides information relevant to the question.
Now, read the following question, reference summary, and provided summary. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}.
Question: {question}
Reference summary: {correct_answers}
Provided summary: {output}
```
Multi-LexSum
```
Please act as an impartial judge and evaluate the quality of the provided summary of a civil lawsuit. The summary is based on a set of legal documents, and it should contain a short description of the background, parties invovled, and the outcomes of the case.
You are not given the entirety of the legal documents, but you will be given with expert-written summaries to help you evaluate the quality of the provided summary. The expert-written summaries come in two forms: the short expert summary contains all the relevant information that the provided summary should contain, and the long expert summary contains other relevant information that the provided summary may or may not contain.
Below is your grading rubric:
Fluency:
- Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete answers, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0.
- Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct answers.
Correctness:
- Score 0 (Incorrect): The summary does not agree with the information provided in the expert summaries at all. The summary either does not contain any information or only contains irrelevant or incorrect information.
- Examples:
- Expert short summary: "This case is about an apprenticeship test that had a disparate impact on Black apprenticeship applicants."
- Provided summary: "This case is about a lawsuit filed by the EEOC against a company for discrimination against Asian employees."
- Score 1 (<=30% correct): Covers less than 30% of the expert short summary.
- Score 2 (<=80% correct): Covers 30%-80% of the expert short summary.
- Score 3 (>80% correct, but irrelevant or incorrect information found): Covers more than 80% of the expert short summary, but mentions other completely irrelevant information or incorrect information.
- Irrelevant information is those that are not relelvant to the case and are not found in the expert short/long summaries.
- Incorrect information is those that are factually incorrect or in conflict with the expert summaries.
- Score 4 (>80% correct and relevant): The provided summary contains almost all major points found in the expert short summary and does not contain any irrelevant information.
Now, read the provided summary and expert summaries, and evaluate the summary using the rubric. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}.
Expert long summary: {long_expert_summary}
Export short summary: {short_expert_summary}
Provided summary: {output}
```
We get both a fluency score (0/1) and a correctness score (0-3 for QA and 0-4 for summarization). The final score is fluency * correctness (think fluency as a “prerequisite”), normalized to 0-100.
Some more details about the evaluation.
- All the evaluation context length is determined by the llama-2 tokenizer to accommodate models with smaller vocabularies.
- For Json KV and RAG, we randomly sample positions of the target key-value pairs or the passages to test “lost-in-the-middle”.
- For ICL, we use abstract labels (0,1,2,3…) instead of natural language labels ([Pan et al., 2023](https://arxiv.org/pdf/2305.09731)) to evaluate models’ ability to learn new tasks.
- We use greedy decoding for all models/tasks.