TinyMistral-248M-v3 / README.md
Locutusque's picture
Adding Evaluation Results (#5)
5afbc96 verified
|
raw
history blame
15.8 kB
metadata
language:
  - en
license: apache-2.0
datasets:
  - Locutusque/TM-DATA-V2
  - LLM360/TxT360
  - mlfoundations/dclm-baseline-1.0
  - Skylion007/openwebtext
  - JeanKaddour/minipile
  - eminorhan/gutenberg_en
model-index:
  - name: TinyMistral-248M-v3
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 16.39
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 1.78
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 0
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 0
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 5.15
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 1.47
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
          name: Open LLM Leaderboard

still in training. Trained on about ~21 billion tokens so far.

Tasks Version Filter n-shot Metric Value Stderr
Open LLM Leaderboard N/A
- arc_challenge 1 none 25 acc 0.2005 ± 0.0117
none 25 acc_norm 0.2406 ± 0.0125
- gsm8k 3 flexible-extract 5 exact_match 0.0083 ± 0.0025
strict-match 5 exact_match 0.0000 ± 0.0000
- hellaswag 1 none 10 acc 0.2724 ± 0.0044
none 10 acc_norm 0.2838 ± 0.0045
- mmlu 2 none acc 0.2290 ± 0.0035
- humanities 2 none acc 0.2380 ± 0.0062
- formal_logic 1 none 5 acc 0.2460 ± 0.0385
- high_school_european_history 1 none 5 acc 0.1818 ± 0.0301
- high_school_us_history 1 none 5 acc 0.2647 ± 0.0310
- high_school_world_history 1 none 5 acc 0.2911 ± 0.0296
- international_law 1 none 5 acc 0.2149 ± 0.0375
- jurisprudence 1 none 5 acc 0.2685 ± 0.0428
- logical_fallacies 1 none 5 acc 0.2209 ± 0.0326
- moral_disputes 1 none 5 acc 0.2457 ± 0.0232
- moral_scenarios 1 none 5 acc 0.2369 ± 0.0142
- philosophy 1 none 5 acc 0.1865 ± 0.0221
- prehistory 1 none 5 acc 0.1975 ± 0.0222
- professional_law 1 none 5 acc 0.2432 ± 0.0110
- world_religions 1 none 5 acc 0.3099 ± 0.0355
- other 2 none acc 0.2375 ± 0.0076
- business_ethics 1 none 5 acc 0.3200 ± 0.0469
- clinical_knowledge 1 none 5 acc 0.2226 ± 0.0256
- college_medicine 1 none 5 acc 0.1965 ± 0.0303
- global_facts 1 none 5 acc 0.1800 ± 0.0386
- human_aging 1 none 5 acc 0.3004 ± 0.0308
- management 1 none 5 acc 0.1942 ± 0.0392
- marketing 1 none 5 acc 0.2735 ± 0.0292
- medical_genetics 1 none 5 acc 0.3000 ± 0.0461
- miscellaneous 1 none 5 acc 0.2478 ± 0.0154
- nutrition 1 none 5 acc 0.2222 ± 0.0238
- professional_accounting 1 none 5 acc 0.2021 ± 0.0240
- professional_medicine 1 none 5 acc 0.1912 ± 0.0239
- virology 1 none 5 acc 0.2590 ± 0.0341
- social sciences 2 none acc 0.2203 ± 0.0075
- econometrics 1 none 5 acc 0.2368 ± 0.0400
- high_school_geography 1 none 5 acc 0.2020 ± 0.0286
- high_school_government_and_politics 1 none 5 acc 0.1865 ± 0.0281
- high_school_macroeconomics 1 none 5 acc 0.2205 ± 0.0210
- high_school_microeconomics 1 none 5 acc 0.2143 ± 0.0267
- high_school_psychology 1 none 5 acc 0.1908 ± 0.0168
- human_sexuality 1 none 5 acc 0.2672 ± 0.0388
- professional_psychology 1 none 5 acc 0.2386 ± 0.0172
- public_relations 1 none 5 acc 0.1727 ± 0.0362
- security_studies 1 none 5 acc 0.2367 ± 0.0272
- sociology 1 none 5 acc 0.2488 ± 0.0306
- us_foreign_policy 1 none 5 acc 0.2600 ± 0.0441
- stem 2 none acc 0.2157 ± 0.0073
- abstract_algebra 1 none 5 acc 0.2200 ± 0.0416
- anatomy 1 none 5 acc 0.1778 ± 0.0330
- astronomy 1 none 5 acc 0.1908 ± 0.0320
- college_biology 1 none 5 acc 0.2778 ± 0.0375
- college_chemistry 1 none 5 acc 0.2200 ± 0.0416
- college_computer_science 1 none 5 acc 0.2100 ± 0.0409
- college_mathematics 1 none 5 acc 0.2100 ± 0.0409
- college_physics 1 none 5 acc 0.2157 ± 0.0409
- computer_security 1 none 5 acc 0.2700 ± 0.0446
- conceptual_physics 1 none 5 acc 0.2638 ± 0.0288
- electrical_engineering 1 none 5 acc 0.2483 ± 0.0360
- elementary_mathematics 1 none 5 acc 0.2037 ± 0.0207
- high_school_biology 1 none 5 acc 0.1774 ± 0.0217
- high_school_chemistry 1 none 5 acc 0.2020 ± 0.0282
- high_school_computer_science 1 none 5 acc 0.2500 ± 0.0435
- high_school_mathematics 1 none 5 acc 0.2148 ± 0.0250
- high_school_physics 1 none 5 acc 0.2053 ± 0.0330
- high_school_statistics 1 none 5 acc 0.1481 ± 0.0242
- machine_learning 1 none 5 acc 0.3125 ± 0.0440
- truthfulqa_gen 3 none 0 bleu_acc 0.2362 ± 0.0149
none 0 bleu_diff -1.0138 ± 0.2569
none 0 bleu_max 7.9522 ± 0.4088
none 0 rouge1_acc 0.2595 ± 0.0153
none 0 rouge1_diff -1.9129 ± 0.4349
none 0 rouge1_max 21.7885 ± 0.7307
none 0 rouge2_acc 0.1200 ± 0.0114
none 0 rouge2_diff -1.9771 ± 0.3475
none 0 rouge2_max 9.0199 ± 0.5842
none 0 rougeL_acc 0.2570 ± 0.0153
none 0 rougeL_diff -1.8812 ± 0.4185
none 0 rougeL_max 19.6284 ± 0.6850
- truthfulqa_mc1 2 none 0 acc 0.1983 ± 0.0140
- truthfulqa_mc2 2 none 0 acc 0.3861 ± 0.0147
- winogrande 1 none 5 acc 0.4972 ± 0.0141
Groups Version Filter n-shot Metric Value Stderr
- mmlu 2 none acc 0.2290 ± 0.0035
- humanities 2 none acc 0.2380 ± 0.0062
- other 2 none acc 0.2375 ± 0.0076
- social sciences 2 none acc 0.2203 ± 0.0075
- stem 2 none acc 0.2157 ± 0.0073
Tasks Version Filter n-shot Metric Value Stderr
agieval_nous 0 none acc_norm 0.2133 ± 0.0081
- agieval_aqua_rat 1 none 0 acc 0.2047 ± 0.0254
none 0 acc_norm 0.1969 ± 0.0250
- agieval_logiqa_en 1 none 0 acc 0.2043 ± 0.0158
none 0 acc_norm 0.2304 ± 0.0165
- agieval_lsat_ar 1 none 0 acc 0.1739 ± 0.0250
none 0 acc_norm 0.1957 ± 0.0262
- agieval_lsat_lr 1 none 0 acc 0.1549 ± 0.0160
none 0 acc_norm 0.1608 ± 0.0163
- agieval_lsat_rc 1 none 0 acc 0.1636 ± 0.0226
none 0 acc_norm 0.2119 ± 0.0250
- agieval_sat_en 1 none 0 acc 0.2670 ± 0.0309
none 0 acc_norm 0.2621 ± 0.0307
- agieval_sat_en_without_passage 1 none 0 acc 0.2670 ± 0.0309
none 0 acc_norm 0.2621 ± 0.0307
- agieval_sat_math 1 none 0 acc 0.2182 ± 0.0279
none 0 acc_norm 0.2318 ± 0.0285
arc_challenge 1 none 0 acc 0.1945 ± 0.0116
none 0 acc_norm 0.2372 ± 0.0124
truthfulqa_mc2 2 none 0 acc 0.3861 ± 0.0147
Groups Version Filter n-shot Metric Value Stderr
agieval_nous 0 none acc_norm 0.2133 ± 0.0081

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 4.13
IFEval (0-Shot) 16.39
BBH (3-Shot) 1.78
MATH Lvl 5 (4-Shot) 0.00
GPQA (0-shot) 0.00
MuSR (0-shot) 5.15
MMLU-PRO (5-shot) 1.47