Excalibur-7b-DPO / README.md
InferenceIllusionist's picture
Adding previous model scores for comparison
aa03379 verified
|
raw
history blame
5.44 kB
metadata
license: apache-2.0
library_name: transformers
tags:
  - finetune
  - dpo
  - chatml
base_model:
  - InferenceIllusionist/Excalibur-7b
datasets:
  - Intel/orca_dpo_pairs
model-index:
  - name: Excalibur-7b-DPO
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 70.9
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=InferenceIllusionist/Excalibur-7b-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.93
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=InferenceIllusionist/Excalibur-7b-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.46
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=InferenceIllusionist/Excalibur-7b-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 70.82
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=InferenceIllusionist/Excalibur-7b-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.48
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=InferenceIllusionist/Excalibur-7b-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.43
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=InferenceIllusionist/Excalibur-7b-DPO
          name: Open LLM Leaderboard

Excalibur-7b-DPO

An initial foray into the world of fine-tuning. The goal of this release was to amplify the quality of the original model's responses, in particular for vision use cases*

GGUFs available here

Notes & Methodology

  • Excalibur-7b fine-tuned with Direct Preference Optimization (DPO) using Intel/orca_dpo_pairs
  • This is a quick experiment to determine the impact of DPO finetuning on the Excelsior-7b base model
  • Ran for a little over an hour on a single A100
  • Fine-tuning succeeded in making model conversational and more well-rounded
  • Benchmark scores increased in the following categories versus base Excelsior-7b:
    • ARC: 69.71 -> 70.9
    • HellaSwag: 87.56 -> 87.93
    • TruthfulQA: 67.24 -> 70.82
    • Average: 73.6 -> 73.84
  • Precision: bfloat16

Sample Question - Vision

*Requires additional mmproj file. You have two options for vision functionality (available inside this repo):

Select the gguf file of your choice in Koboldcpp as usual, then make sure to choose the mmproj file above in the LLaVA mmproj field of the model submenu:

Prompt Format

  • For best results please use ChatML for the prompt format. Alpaca may also work.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 73.84
AI2 Reasoning Challenge (25-Shot) 70.90
HellaSwag (10-Shot) 87.93
MMLU (5-Shot) 65.46
TruthfulQA (0-shot) 70.82
Winogrande (5-shot) 82.48
GSM8k (5-shot) 65.43