Andron00e's picture
Adding Evaluation Results
4533cf8 verified
|
raw
history blame
6.4 kB
metadata
language:
  - en
license: apache-2.0
library_name: transformers
datasets:
  - Open-Orca/OpenOrca
metrics:
  - accuracy
pipeline_tag: question-answering
model-index:
  - name: YetAnother_Open-Llama-3B-LoRA-OpenOrca
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 25.94
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Andron00e/YetAnother_Open-Llama-3B-LoRA-OpenOrca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 25.76
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Andron00e/YetAnother_Open-Llama-3B-LoRA-OpenOrca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 24.65
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Andron00e/YetAnother_Open-Llama-3B-LoRA-OpenOrca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 0
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Andron00e/YetAnother_Open-Llama-3B-LoRA-OpenOrca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 50.83
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Andron00e/YetAnother_Open-Llama-3B-LoRA-OpenOrca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 0
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Andron00e/YetAnother_Open-Llama-3B-LoRA-OpenOrca
          name: Open LLM Leaderboard

Model Details

Model Description

  • Developed by: Andron00e
  • Language(s) (NLP): Python (PyTorch, transformers, peft)
  • License: apache-2.0
  • Finetuned from model: openlm-research/open_llama_3b

Model Sources [optional]

Training Data

https://huggingface.co/datasets/Open-Orca/OpenOrca

Evaluation

Evaluation of the model was carried out using EulerAI library, more precisely

Testing Data

hellaswag testing dataset

Metrics

Accuracy

Results and Model Examination

Task Version Metric Value Stderr
hellaswag 0 acc 0.4899 0.0050
acc_norm 0.6506 0.0048

Citations

@software{openlm2023openllama,
  author = {Geng, Xinyang and Liu, Hao},
  title = {OpenLLaMA: An Open Reproduction of LLaMA},
  month = May,
  year = 2023,
  url = {https://github.com/openlm-research/open_llama}
}
@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628}
}

Model Card Authors and Contact

Andron00e

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 21.20
AI2 Reasoning Challenge (25-Shot) 25.94
HellaSwag (10-Shot) 25.76
MMLU (5-Shot) 24.65
TruthfulQA (0-shot) 0.00
Winogrande (5-shot) 50.83
GSM8k (5-shot) 0.00