Master-Yi-9B / README.md
qnguyen3's picture
Update README.md
882bf73 verified
|
raw
history blame
35.9 kB
metadata
license: apache-2.0

Model Description

Master is a collection of LLMs trained using human-collected seed questions and regenerate the answers with a mixture of high performance Open-source LLMs.

Master-Yi-9B is trained using the ORPO techniques. The model shows strong abilities in reasoning on coding and math questions.

Quantized Version: Here

Master-Yi-9B-Vision: Coming Soon

img

Prompt Template

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
What is the meaning of life?<|im_end|>
<|im_start|>assistant

Examples

image/png

image/png

Inference Code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "vilm/VinaLlama2-14B",
    torch_dtype='auto',
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("vilm/VinaLlama2-14B")

prompt = "What is the mearning of life?"
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id,
    temperature=0.25,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Benchmarks

Nous Benchmark:

Model AGIEval GPT4All TruthfulQA Bigbench Average
Master-Yi-9B 43.55 71.48 48.54 41.43 51.25

AGIEval

|             Task             |Version| Metric |Value|   |Stderr|
|------------------------------|------:|--------|----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |35.83|±  |  3.01|
|                              |       |acc_norm|31.89|±  |  2.93|
|agieval_logiqa_en             |      0|acc     |38.25|±  |  1.91|
|                              |       |acc_norm|37.79|±  |  1.90|
|agieval_lsat_ar               |      0|acc     |23.04|±  |  2.78|
|                              |       |acc_norm|20.43|±  |  2.66|
|agieval_lsat_lr               |      0|acc     |48.04|±  |  2.21|
|                              |       |acc_norm|42.75|±  |  2.19|
|agieval_lsat_rc               |      0|acc     |61.34|±  |  2.97|
|                              |       |acc_norm|52.79|±  |  3.05|
|agieval_sat_en                |      0|acc     |79.13|±  |  2.84|
|                              |       |acc_norm|72.33|±  |  3.12|
|agieval_sat_en_without_passage|      0|acc     |44.17|±  |  3.47|
|                              |       |acc_norm|42.72|±  |  3.45|
|agieval_sat_math              |      0|acc     |52.27|±  |  3.38|
|                              |       |acc_norm|47.73|±  |  3.38|

Average: 43.55%

GPT4All

|    Task     |Version| Metric |Value|   |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge|      0|acc     |54.95|±  |  1.45|
|             |       |acc_norm|58.70|±  |  1.44|
|arc_easy     |      0|acc     |82.28|±  |  0.78|
|             |       |acc_norm|81.10|±  |  0.80|
|boolq        |      1|acc     |86.15|±  |  0.60|
|hellaswag    |      0|acc     |59.16|±  |  0.49|
|             |       |acc_norm|77.53|±  |  0.42|
|openbookqa   |      0|acc     |37.40|±  |  2.17|
|             |       |acc_norm|44.00|±  |  2.22|
|piqa         |      0|acc     |79.00|±  |  0.95|
|             |       |acc_norm|80.25|±  |  0.93|
|winogrande   |      0|acc     |72.61|±  |  1.25|

Average: 71.48%

TruthfulQA

|    Task     |Version|Metric|Value|   |Stderr|
|-------------|------:|------|----:|---|-----:|
|truthfulqa_mc|      1|mc1   |33.05|±  |  1.65|
|             |       |mc2   |48.54|±  |  1.54|

Average: 48.54%

Bigbench

|                      Task                      |Version|       Metric        |Value|   |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|54.74|±  |  3.62|
|bigbench_date_understanding                     |      0|multiple_choice_grade|68.02|±  |  2.43|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|40.31|±  |  3.06|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|30.36|±  |  2.43|
|                                                |       |exact_str_match      | 2.23|±  |  0.78|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|26.00|±  |  1.96|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|20.71|±  |  1.53|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|44.00|±  |  2.87|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|35.00|±  |  2.14|
|bigbench_navigate                               |      0|multiple_choice_grade|58.40|±  |  1.56|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|61.80|±  |  1.09|
|bigbench_ruin_names                             |      0|multiple_choice_grade|42.41|±  |  2.34|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|31.56|±  |  1.47|
|bigbench_snarks                                 |      0|multiple_choice_grade|55.25|±  |  3.71|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|69.37|±  |  1.47|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|27.70|±  |  1.42|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|21.36|±  |  1.16|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|14.69|±  |  0.85|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|44.00|±  |  2.87|

Average: 41.43%

Average score: 51.25%

OpenLLM Benchmark:

Model ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K Average
Master-Yi-9B 61.6 79.89 69.95 48.59 77.35 67.48 67.48

ARC

|    Task     |Version|       Metric       |    Value    |   |Stderr|
|-------------|------:|--------------------|-------------|---|------|
|arc_challenge|      1|acc,none            |         0.59|   |      |
|             |       |acc_stderr,none     |         0.01|   |      |
|             |       |acc_norm,none       |         0.62|   |      |
|             |       |acc_norm_stderr,none|         0.01|   |      |
|             |       |alias               |arc_challenge|   |      |

Average: 61.6%

HellaSwag

|  Task   |Version|       Metric       |  Value  |   |Stderr|
|---------|------:|--------------------|---------|---|------|
|hellaswag|      1|acc,none            |     0.61|   |      |
|         |       |acc_stderr,none     |        0|   |      |
|         |       |acc_norm,none       |     0.80|   |      |
|         |       |acc_norm_stderr,none|        0|   |      |
|         |       |alias               |hellaswag|   |      |

Average: 79.89%

MMLU

|                  Task                  |Version|    Metric     |                 Value                 |   |Stderr|
|----------------------------------------|-------|---------------|---------------------------------------|---|------|
|mmlu                                    |N/A    |acc,none       |                                    0.7|   |      |
|                                        |       |acc_stderr,none|                                      0|   |      |
|                                        |       |alias          |mmlu                                   |   |      |
|mmlu_abstract_algebra                   |      0|alias          |  - abstract_algebra                   |   |      |
|                                        |       |acc,none       |0.46                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_anatomy                            |      0|alias          |  - anatomy                            |   |      |
|                                        |       |acc,none       |0.64                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_astronomy                          |      0|alias          |  - astronomy                          |   |      |
|                                        |       |acc,none       |0.77                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_business_ethics                    |      0|alias          |  - business_ethics                    |   |      |
|                                        |       |acc,none       |0.76                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_clinical_knowledge                 |      0|alias          |  - clinical_knowledge                 |   |      |
|                                        |       |acc,none       |0.71                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_college_biology                    |      0|alias          |  - college_biology                    |   |      |
|                                        |       |acc,none       |0.82                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_college_chemistry                  |      0|alias          |  - college_chemistry                  |   |      |
|                                        |       |acc,none       |0.52                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_college_computer_science           |      0|alias          |  - college_computer_science           |   |      |
|                                        |       |acc,none       |0.56                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_college_mathematics                |      0|alias          |  - college_mathematics                |   |      |
|                                        |       |acc,none       |0.44                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_college_medicine                   |      0|alias          |  - college_medicine                   |   |      |
|                                        |       |acc,none       |0.72                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_college_physics                    |      0|alias          |  - college_physics                    |   |      |
|                                        |       |acc,none       |0.45                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_computer_security                  |      0|alias          |  - computer_security                  |   |      |
|                                        |       |acc,none       |0.81                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_conceptual_physics                 |      0|alias          |  - conceptual_physics                 |   |      |
|                                        |       |acc,none       |0.74                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_econometrics                       |      0|alias          |  - econometrics                       |   |      |
|                                        |       |acc,none       |0.65                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_electrical_engineering             |      0|alias          |  - electrical_engineering             |   |      |
|                                        |       |acc,none       |0.72                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_elementary_mathematics             |      0|alias          |  - elementary_mathematics             |   |      |
|                                        |       |acc,none       |0.62                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_formal_logic                       |      0|alias          |  - formal_logic                       |   |      |
|                                        |       |acc,none       |0.57                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_global_facts                       |      0|alias          |  - global_facts                       |   |      |
|                                        |       |acc,none       |0.46                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_high_school_biology                |      0|alias          |  - high_school_biology                |   |      |
|                                        |       |acc,none       |0.86                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_high_school_chemistry              |      0|alias          |  - high_school_chemistry              |   |      |
|                                        |       |acc,none       |0.67                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_high_school_computer_science       |      0|alias          |  - high_school_computer_science       |   |      |
|                                        |       |acc,none       |0.84                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_high_school_european_history       |      0|alias          |  - high_school_european_history       |   |      |
|                                        |       |acc,none       |0.82                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_high_school_geography              |      0|alias          |  - high_school_geography              |   |      |
|                                        |       |acc,none       |0.86                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_high_school_government_and_politics|      0|alias          |  - high_school_government_and_politics|   |      |
|                                        |       |acc,none       |0.90                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_high_school_macroeconomics         |      0|alias          |  - high_school_macroeconomics         |   |      |
|                                        |       |acc,none       |0.75                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_high_school_mathematics            |      0|alias          |  - high_school_mathematics            |   |      |
|                                        |       |acc,none       |0.43                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_high_school_microeconomics         |      0|alias          |  - high_school_microeconomics         |   |      |
|                                        |       |acc,none       |0.86                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_high_school_physics                |      0|alias          |  - high_school_physics                |   |      |
|                                        |       |acc,none       |0.45                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_high_school_psychology             |      0|alias          |  - high_school_psychology             |   |      |
|                                        |       |acc,none       |0.87                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_high_school_statistics             |      0|alias          |  - high_school_statistics             |   |      |
|                                        |       |acc,none       |0.68                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_high_school_us_history             |      0|alias          |  - high_school_us_history             |   |      |
|                                        |       |acc,none       |0.85                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_high_school_world_history          |      0|alias          |  - high_school_world_history          |   |      |
|                                        |       |acc,none       |0.85                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_human_aging                        |      0|alias          |  - human_aging                        |   |      |
|                                        |       |acc,none       |0.76                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_human_sexuality                    |      0|alias          |  - human_sexuality                    |   |      |
|                                        |       |acc,none       |0.78                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_humanities                         |N/A    |alias          | - humanities                          |   |      |
|                                        |       |acc,none       |0.63                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_international_law                  |      0|alias          |  - international_law                  |   |      |
|                                        |       |acc,none       |0.79                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_jurisprudence                      |      0|alias          |  - jurisprudence                      |   |      |
|                                        |       |acc,none       |0.79                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_logical_fallacies                  |      0|alias          |  - logical_fallacies                  |   |      |
|                                        |       |acc,none       |0.80                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_machine_learning                   |      0|alias          |  - machine_learning                   |   |      |
|                                        |       |acc,none       |0.52                                   |   |      |
|                                        |       |acc_stderr,none|0.05                                   |   |      |
|mmlu_management                         |      0|alias          |  - management                         |   |      |
|                                        |       |acc,none       |0.83                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_marketing                          |      0|alias          |  - marketing                          |   |      |
|                                        |       |acc,none       |0.89                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_medical_genetics                   |      0|alias          |  - medical_genetics                   |   |      |
|                                        |       |acc,none       |0.78                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_miscellaneous                      |      0|alias          |  - miscellaneous                      |   |      |
|                                        |       |acc,none       |0.85                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_moral_disputes                     |      0|alias          |  - moral_disputes                     |   |      |
|                                        |       |acc,none       |0.75                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_moral_scenarios                    |      0|alias          |  - moral_scenarios                    |   |      |
|                                        |       |acc,none       |0.48                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_nutrition                          |      0|alias          |  - nutrition                          |   |      |
|                                        |       |acc,none       |0.77                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_other                              |N/A    |alias          | - other                               |   |      |
|                                        |       |acc,none       |0.75                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_philosophy                         |      0|alias          |  - philosophy                         |   |      |
|                                        |       |acc,none       |0.78                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_prehistory                         |      0|alias          |  - prehistory                         |   |      |
|                                        |       |acc,none       |0.77                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_professional_accounting            |      0|alias          |  - professional_accounting            |   |      |
|                                        |       |acc,none       |0.57                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_professional_law                   |      0|alias          |  - professional_law                   |   |      |
|                                        |       |acc,none       |0.50                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_professional_medicine              |      0|alias          |  - professional_medicine              |   |      |
|                                        |       |acc,none       |0.71                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_professional_psychology            |      0|alias          |  - professional_psychology            |   |      |
|                                        |       |acc,none       |0.73                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_public_relations                   |      0|alias          |  - public_relations                   |   |      |
|                                        |       |acc,none       |0.76                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_security_studies                   |      0|alias          |  - security_studies                   |   |      |
|                                        |       |acc,none       |0.78                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_social_sciences                    |N/A    |alias          | - social_sciences                     |   |      |
|                                        |       |acc,none       |0.81                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_sociology                          |      0|alias          |  - sociology                          |   |      |
|                                        |       |acc,none       |0.86                                   |   |      |
|                                        |       |acc_stderr,none|0.02                                   |   |      |
|mmlu_stem                               |N/A    |alias          | - stem                                |   |      |
|                                        |       |acc,none       |0.65                                   |   |      |
|                                        |       |acc_stderr,none|0.01                                   |   |      |
|mmlu_us_foreign_policy                  |      0|alias          |  - us_foreign_policy                  |   |      |
|                                        |       |acc,none       |0.92                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |
|mmlu_virology                           |      0|alias          |  - virology                           |   |      |
|                                        |       |acc,none       |0.58                                   |   |      |
|                                        |       |acc_stderr,none|0.04                                   |   |      |
|mmlu_world_religions                    |      0|alias          |  - world_religions                    |   |      |
|                                        |       |acc,none       |0.82                                   |   |      |
|                                        |       |acc_stderr,none|0.03                                   |   |      |

Average: 69.95%

TruthfulQA

|     Task     |Version|        Metric         |      Value      |   |Stderr|
|--------------|-------|-----------------------|-----------------|---|------|
|truthfulqa    |N/A    |bleu_acc,none          |             0.45|   |      |
|              |       |bleu_acc_stderr,none   |             0.02|   |      |
|              |       |rouge1_acc,none        |             0.45|   |      |
|              |       |rouge1_acc_stderr,none |             0.02|   |      |
|              |       |rouge2_diff,none       |             0.92|   |      |
|              |       |rouge2_diff_stderr,none|             1.07|   |      |
|              |       |bleu_max,none          |            23.77|   |      |
|              |       |bleu_max_stderr,none   |             0.81|   |      |
|              |       |rouge2_acc,none        |             0.38|   |      |
|              |       |rouge2_acc_stderr,none |             0.02|   |      |
|              |       |acc,none               |             0.41|   |      |
|              |       |acc_stderr,none        |             0.01|   |      |
|              |       |rougeL_diff,none       |             1.57|   |      |
|              |       |rougeL_diff_stderr,none|             0.93|   |      |
|              |       |rougeL_acc,none        |             0.46|   |      |
|              |       |rougeL_acc_stderr,none |             0.02|   |      |
|              |       |bleu_diff,none         |             1.38|   |      |
|              |       |bleu_diff_stderr,none  |             0.75|   |      |
|              |       |rouge2_max,none        |            33.01|   |      |
|              |       |rouge2_max_stderr,none |             1.05|   |      |
|              |       |rouge1_diff,none       |             1.72|   |      |
|              |       |rouge1_diff_stderr,none|             0.92|   |      |
|              |       |rougeL_max,none        |            45.25|   |      |
|              |       |rougeL_max_stderr,none |             0.92|   |      |
|              |       |rouge1_max,none        |            48.29|   |      |
|              |       |rouge1_max_stderr,none |             0.90|   |      |
|              |       |alias                  |truthfulqa       |   |      |
|truthfulqa_gen|      3|bleu_max,none          |            23.77|   |      |
|              |       |bleu_max_stderr,none   |             0.81|   |      |
|              |       |bleu_acc,none          |             0.45|   |      |
|              |       |bleu_acc_stderr,none   |             0.02|   |      |
|              |       |bleu_diff,none         |             1.38|   |      |
|              |       |bleu_diff_stderr,none  |             0.75|   |      |
|              |       |rouge1_max,none        |            48.29|   |      |
|              |       |rouge1_max_stderr,none |             0.90|   |      |
|              |       |rouge1_acc,none        |             0.45|   |      |
|              |       |rouge1_acc_stderr,none |             0.02|   |      |
|              |       |rouge1_diff,none       |             1.72|   |      |
|              |       |rouge1_diff_stderr,none|             0.92|   |      |
|              |       |rouge2_max,none        |            33.01|   |      |
|              |       |rouge2_max_stderr,none |             1.05|   |      |
|              |       |rouge2_acc,none        |             0.38|   |      |
|              |       |rouge2_acc_stderr,none |             0.02|   |      |
|              |       |rouge2_diff,none       |             0.92|   |      |
|              |       |rouge2_diff_stderr,none|             1.07|   |      |
|              |       |rougeL_max,none        |            45.25|   |      |
|              |       |rougeL_max_stderr,none |             0.92|   |      |
|              |       |rougeL_acc,none        |             0.46|   |      |
|              |       |rougeL_acc_stderr,none |             0.02|   |      |
|              |       |rougeL_diff,none       |             1.57|   |      |
|              |       |rougeL_diff_stderr,none|             0.93|   |      |
|              |       |alias                  | - truthfulqa_gen|   |      |
|truthfulqa_mc1|      2|acc,none               |             0.33|   |      |
|              |       |acc_stderr,none        |             0.02|   |      |
|              |       |alias                  | - truthfulqa_mc1|   |      |
|truthfulqa_mc2|      2|acc,none               |             0.49|   |      |
|              |       |acc_stderr,none        |             0.02|   |      |
|              |       |alias                  | - truthfulqa_mc2|   |      |

Average: 48.59%

Winogrande

|   Task   |Version|    Metric     |  Value   |   |Stderr|
|----------|------:|---------------|----------|---|------|
|winogrande|      1|acc,none       |      0.77|   |      |
|          |       |acc_stderr,none|      0.01|   |      |
|          |       |alias          |winogrande|   |      |

Average: 77.35%

GSM8K

|Task |Version|              Metric               |Value|   |Stderr|
|-----|------:|-----------------------------------|-----|---|------|
|gsm8k|      3|exact_match,strict-match           | 0.67|   |      |
|     |       |exact_match_stderr,strict-match    | 0.01|   |      |
|     |       |exact_match,flexible-extract       | 0.68|   |      |
|     |       |exact_match_stderr,flexible-extract| 0.01|   |      |
|     |       |alias                              |gsm8k|   |      |

Average: 67.48%

Average score: 67.48%