Safetensors
llama
leaderboard-pr-bot's picture
Adding Evaluation Results
7068749 verified
|
raw
history blame
24.6 kB
metadata
license: llama3
model-index:
  - name: Llama-3-8B-ProLong-512k-Instruct
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 41.3
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 28.44
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 4.46
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 2.24
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 11.66
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 24.9
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
          name: Open LLM Leaderboard

princeton_nlp/Llama-3-8B-ProLong-512k-Instruct

Contributors: Tianyu Gao*, Alexander Wettig* (*equal contribution), Howard Yen, Danqi Chen

Contact: {tianyug, awettig}@princeton.edu

💡 ProLong stands for Princeton Long-Context!

The ProLong Models

Features

  • Based on meta-llama/Meta-Llama-3-8B-Instruct (original max length: 8K), we produce a long-context instruction-tuned model that can stably handle up to 64K tokens. We also have a version that can process up to 512K tokens.
  • This model is trained on
    • 20B carefully curated data mixture of short and long data (max length 64K). You can find our base model here.
    • For the 512K version, we continue training the base model for 20B more tokens, with a mixture of short, long (64K), and ultra long (512K) data. You can find the corresponding base model here.
    • Then we fine-tuned them on UltraChat to regain chat ability.
  • On a range of long-context tasks, our ProLong models achieve the better/comparable performance among models of similar sizes, including Llama 3.1 (128K context window), which was trained on 20x more long-context data (800B vs. 40B).
  • ProLong is also the first open-source model to effectively support a 512K context window.
  • We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. We will include more details in our technique report (coming soon).

Benchmarking results

64K result:

image/png

512K result:

image/png

You can find results for more tasks and models in this spreadsheet. In this detailed results, we show that our model can retain the original Llama-3's general LM performance (on tasks selected by the HF Open LLM Leaderboard v1). This is non-trivial in long-context fine-tuning and requires a careful selection of the fine-tuning data mixture and the training configurations.

Understanding long-context performance is tricky, as there is no consensus on what’s effective long-context evaluation or how well those existing benchmarks reflect real-world use case. In this work, we curate a combination of existing and new tasks across both synthetic and natural datasets to demonstrate the strength of our model.

We divide the tasks into the following categories:

  • Recall: we use synthetic Json key-value retrieval task (lost-in-the-middle, Liu et al., 2023; ∞BENCH, Zhang et al., 2024) to test the model’s ability to retrieve arbitrary information from the context. This is a more comprehensive and reliable version of needle-in-a-haystack.

  • Retrieval-augmented generation (RAG): we use existing open-domain question answering datasets in a multi-document QA format (Liu et al., 2023). Datasets we select include NaturalQuestion (Kwiatkowski et al., 2019), HotpotQA (Yang et al., 2018), and PopQA (Mallen et al., 2023). The gold document is put to different positions to test “lost-in-the-middle”.

  • In-context learning (ICL): ICL tasks have been established to evaluate long-context abilities (Li et al., 2024; Bertsch et al., 2024). We follow Bertsch et al., 2024 and use the following five tasks: TREC, TREC-fine (Hovy et al., 2001), NLU (Liu et al., 2019), Banking-77 (Casanueva et al., 2020), Clinc-150 (Larson et al., 2019).

  • Reranking: Given a query and a number of retrieved passages (by an off-the-shelf model), reranking requires the model to generate the IDs of the top-10 passages. This has been shown to be a realistic application (Sun et al., 2023) and is also challenging, as it requires reasoning/comparison across documents. We use MSMARCO (Bajaj et al., 2018) for this task.

  • Long-document QA/summarization: These are the most straightforward applications. We selected some of the public tasks that have the longest documents, including NarrativeQA (Kočiský et al., 2017), Qasper (Dasigi et al., 2021), QMSum (Zhong et al., 2021), and Multi-LexSum (Shen et al., 2022). As traditional evaluation metrics like rouge or F1 do not reflect the performance well, we use GPT-4o to score the model output given the gold output and the question.

    Find details about our GPT-4o rubrics for the long-document QA/summarization tasks.
      We use the following prompt to evaluate NarrativeQA and Qasper:
      
      ```
      Please act as an impartial judge and evaluate the quality of the provided answer which attempts to answer the provided question based on a provided context.
      Although you are not given the context, you will be given a set of correct answers that achieves full scores on all metrics, and you need to assess the provided answers using the correct answers.
      
      Below is your grading rubric:
      
      Fluency:
      - Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete answers, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0.
      - Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct answers.
      
      Correctness:
      - Score 0 (Incorrect): The answer does not agree with the provided correct answers at all.
      - Score 1 (partly correct): Partly agree with one of the provided correct answers (for example, the question asks for a date and a person; the answer gets the date right but the person wrong).
      - Score 2 (correct but not fully relevant): Fully agrees with one of the provided correct answers but mentions other completely irrelevant information. Note that extra details provided in the answer, even if not mentioned in the correct answers, should NOT be seen as irrelevant as long as they are relevant to the question to a reasonable extend.
      - Score 3 (correct and relevant): Fully agrees with one of the provided correct answers and only provides information relevant to the question. Note that if the answer is longer than the correct answer, as long as everything in the answer is relevant to the question, it should still be given score 3. For example, if the correct answer is "the North Pole" and the answer is "They are headed for the North Pole", it should still be given a score of 3.
      
      Now, read the following question, answer, and correct answers. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}.
      
      Question: {question}
      Correct answers: {correct_answers}
      Answer: {output}
      ```
      
      For QMSum:
      
      ```
      Please act as an impartial judge and evaluate the quality of the provided summary with respect to a summarization inquiry based on a meeting transcript.
      Although you are not given the transcript, you will be given a reference summary that achieves full scores on all metrics, and you need to assess the provided summary using the reference one.
      
      Below is your grading rubric:
      
      Fluency:
      - Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete sentences, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0.
      - Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct summaries.
      
      Correctness:
      - Score 0 (Incorrect): The summary does not agree (have overlap) with the reference summary at all.
      - Score 1 (<=30% correct): Covers less than 30% of the reference summary.
      - Score 2 (<=80% correct): Covers 30%-80% of the reference summary.
      - Score 3 (>80% correct, but not fully relevant): Covers more than 80% of the reference summary, but mentions other completely irrelevant information. Note that extra details provided in the summary, even if not mentioned in the reference summary, should NOT be seen as irrelevant as long as they are relevant to the query to a reasonable extend.
      - Score 4 (>80% correct and relevant): Almost fully agrees with the reference and only provides information relevant to the question.
      
      Now, read the following question, reference summary, and provided summary. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}.
      
      Question: {question}
      Reference summary: {correct_answers}
      Provided summary: {output}
      ```
      
      Multi-LexSum
      
      ```
      Please act as an impartial judge and evaluate the quality of the provided summary of a civil lawsuit. The summary is based on a set of legal documents, and it should contain a short description of the background, parties invovled, and the outcomes of the case.
      You are not given the entirety of the legal documents, but you will be given with expert-written summaries to help you evaluate the quality of the provided summary. The expert-written summaries come in two forms: the short expert summary contains all the relevant information that the provided summary should contain, and the long expert summary contains other relevant information that the provided summary may or may not contain.
      
      Below is your grading rubric:
      
      Fluency:
      - Score 0 (incoherent, repetitive, or incomplete): Incoherent sentences, repetitive sentences (even if not by exact words), incomplete answers, or gibberish. Note that even if the answer is coherent, if it is repetitive or incomplete, it should be given a score of 0.
      - Score 1 (coherent, non-repetitive answer): Coherent, non-repetitive, fluent, grammatically correct answers.
      
      Correctness:
      - Score 0 (Incorrect): The summary does not agree with the information provided in the expert summaries at all. The summary either does not contain any information or only contains irrelevant or incorrect information.
        - Examples:
          - Expert short summary: "This case is about an apprenticeship test that had a disparate impact on Black apprenticeship applicants."
          - Provided summary: "This case is about a lawsuit filed by the EEOC against a company for discrimination against Asian employees."
      - Score 1 (<=30% correct): Covers less than 30% of the expert short summary.
      - Score 2 (<=80% correct): Covers 30%-80% of the expert short summary.
      - Score 3 (>80% correct, but irrelevant or incorrect information found): Covers more than 80% of the expert short summary, but mentions other completely irrelevant information or incorrect information.
        - Irrelevant information is those that are not relelvant to the case and are not found in the expert short/long summaries.
        - Incorrect information is those that are factually incorrect or in conflict with the expert summaries.
      - Score 4 (>80% correct and relevant): The provided summary contains almost all major points found in the expert short summary and does not contain any irrelevant information.
      
      Now, read the provided summary and expert summaries, and evaluate the summary using the rubric. First think step-by-step and provide your reasoning and assessment on the answer. Then output your score in the following json format: {{"fluency": 0, "correctness": 1}}.
      
      Expert long summary: {long_expert_summary}
      
      Export short summary: {short_expert_summary}
      
      Provided summary: {output}
      ```
      
      We get both a fluency score (0/1) and a correctness score (0-3 for QA and 0-4 for summarization). The final score is fluency * correctness (think fluency as a “prerequisite”), normalized to 0-100.
    </details>
    

Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report. The evaluation code will be available here.

Some more details about the evaluation.
  • All the evaluation context length is determined by the llama-2 tokenizer to accommodate models with smaller vocabularies.
  • For Json KV and RAG, we randomly sample positions of the target key-value pairs or the passages to test “lost-in-the-middle”.
  • For ICL, we use abstract labels (0,1,2,3…) instead of natural language labels (Pan et al., 2023) to evaluate models’ ability to learn new tasks.
  • We use greedy decoding for all models/tasks.

Efficient training techniques

We integrate several pieces of efficient training techniques in producing our models:

  • We use FlashAttention-2 (Dao et al., 2023)’s variable length attention and stop the attention across document boundaries. We combine variable length attention with smart batching (batching sequences with similar lengths in one step) and achieve significant speedup.
  • To handle Llama-3’s large vocabulary and avoid the memory overhead from materializing a huge logit matrix, we computing the cross entropy loss in chunks of 8192 tokens.
  • For training the 512K model, we adapt DeepSpeed-Ulysses (Jacobs et al., 2023) for sequence parallelism.

Stage 1: long-context training (64K)

We used the following data mixture and traind meta-llama/Meta-Llama-3-8B-Instruct for 20B tokens. Notably, we use 40% high-quality short-context data, 30% 64K book data and 30% 64K code data.

image/png

Data sources
Books Only 64k tokens
Textbooks Chapters concat. by book and topic
The Stack V1 Source files concat. by repo; only 64K tokens
StackExchange
Tulu-v2
Wikipedia
Arxiv
OpenWebMath
FineWeb
FineWeb-EDU

For the stack v1, we concatenate all the files from the same repo (a strategy introduced by DeepSeek Coder; Guo et al., 2024). For the stack v1 and books, we only keep documents that are longer than 64K.

We use the following hyperparameters:

Name Hyperparameter
Batch size 4M tokens
Peak learning rate 1e-5
Scheduling 5% warmup, cosine decay till 10% peak learning rate
Total #tokens 20B
Rope theta 8M

In our preliminary experiments, we found that

  • One of the challenges in long-context training is to preserve the general LM performance.
    • At the beginning of training, the general LM performance degrades potentially due to optimizer state warmup, data mixture mismatch, and the length extension.
    • We found that the right rope theta + the right data mixture + longer training + low LR help alleviate this problem.
    • Using variable length attention and stopping attention across document boundaries helps preserve the general LM performance.
    • Other warmup schemes like progressively increasing the length (similar to LWM; Liu et al., 2024) do not seem to provide more benefit in our experiments.

We will release more details of our ablations in our technical report!

Stage 2: long-context training (512K)

For our 512K model, we continued training the previous model for another 20B tokens. The data mixture is similar -- 40% short data and 60% long data. Among the long data, we use 15% 64K code, 15% 512K code, 25% 64K book, and 5% 512K book -- the ratio largely correlates with code/book's natural distribution.

image/png

Data sources
Books Only 64k tokens
Textbooks Chapters concat. by book and topic
The Stack V1 Source files concat. by repo; only 64K tokens
StackExchange
Tulu-v2
Wikipedia
Arxiv
OpenWebMath
FineWeb
FineWeb-EDU

Stage 3: instruction tuning

We conduct supervised fine-tuning (SFT) on our base long-context model. In our preliminary experiments, we found that using UltraChat (Ding et al., 2023) leads to the best long-context results (among UltraChat, Tulu (Wang et al., 2023), and ShareGPT). Note that this only reflects the performance on our benchmark, not representing the overall quality of those datasets. The hyperparameters we used for SFT are as follows:

Name Hyperparameter
Batch size 4M tokens
Peak learning rate 2e-5
Scheduling 5% warmup, cosine decay till 10% peak learning rate
Total #tokens 1B
  • Synthetic data: we also experiment with several strategies to generate long, synthetic chat data, but they have not yet helped to improve upon our UltraChat-fine-tuned chat models. The synthetic data strategies we tried include (1) using a paragraph of a long book/repo to generate question-answer pairs; (2) using hierarchical methods to summarize a long book; (3) turning the previous synthetic long QA data into a RAG format.

Citation

If you find our model useful, please cite by

@misc{gao2024prolong,
   title={ProLong Long-Context Language Model Series},
   author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
   year={2024},
   url="https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-64k-Instruct"
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 18.83
IFEval (0-Shot) 41.30
BBH (3-Shot) 28.44
MATH Lvl 5 (4-Shot) 4.46
GPQA (0-shot) 2.24
MuSR (0-shot) 11.66
MMLU-PRO (5-shot) 24.90