Added Nous Eval-Scores
Browse files
README.md
CHANGED
@@ -61,4 +61,72 @@ pipeline = transformers.pipeline(
|
|
61 |
|
62 |
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
|
63 |
print(outputs[0]["generated_text"])
|
64 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
|
62 |
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
|
63 |
print(outputs[0]["generated_text"])
|
64 |
+
```
|
65 |
+
|
66 |
+
# 🏆 Evaluation Scores
|
67 |
+
|
68 |
+
## Nous
|
69 |
+
|
70 |
+
| Model |AGIEval|TruthfulQA|Bigbench|
|
71 |
+
|---------------------------------------------------------------------------------------------|------:|---------:|-------:|
|
72 |
+
|[yuvraj17/Llama3-8B-Instruct-Slerp](https://huggingface.co/yuvraj17/Llama3-8B-Instruct-Slerp)| 38.32| 57.15| 43.91|
|
73 |
+
|
74 |
+
|
75 |
+
### AGIEval
|
76 |
+
| Task |Version| Metric | Value | | Stderr |
|
77 |
+
|------------------------------|------:|---------|------:|---|-------:|
|
78 |
+
| agieval_aqua_rat | 0| acc | 23.62 |± | 2.67 |
|
79 |
+
| | | acc_norm| 22.05 |± | 2.61 |
|
80 |
+
| agieval_logiqa_en | 0| acc | 27.50 |± | 1.75 |
|
81 |
+
| | | acc_norm| 31.80 |± | 1.83 |
|
82 |
+
| agieval_lsat_ar | 0| acc | 21.30 |± | 2.71 |
|
83 |
+
| | | acc_norm| 20.87 |± | 2.69 |
|
84 |
+
| agieval_lsat_lr | 0| acc | 35.29 |± | 2.12 |
|
85 |
+
| | | acc_norm| 37.65 |± | 2.15 |
|
86 |
+
| agieval_lsat_rc | 0| acc | 42.01 |± | 3.01 |
|
87 |
+
| | | acc_norm| 39.78 |± | 2.99 |
|
88 |
+
| agieval_sat_en | 0| acc | 55.83 |± | 3.47 |
|
89 |
+
| | | acc_norm| 50.49 |± | 3.49 |
|
90 |
+
| agieval_sat_en_without_passage| 0| acc | 36.89 |± | 3.37 |
|
91 |
+
| | | acc_norm| 34.95 |± | 3.33 |
|
92 |
+
| agieval_sat_math | 0| acc | 29.55 |± | 3.08 |
|
93 |
+
| | | acc_norm| 28.64 |± | 3.05 |
|
94 |
+
|
95 |
+
**Average score**: 33.28%
|
96 |
+
|
97 |
+
### TruthfulQA
|
98 |
+
|
99 |
+
|
100 |
+
| Task |Version| Metric | Value | | Stderr |
|
101 |
+
|---------------------|------:|--------|------:|---|-------:|
|
102 |
+
| truthfulqa_mc | 1| mc1 | 33.54 |± | 1.65 |
|
103 |
+
| | | mc2 | 49.78 |± | 1.53 |
|
104 |
+
|
105 |
+
**Average score**: 49.78%
|
106 |
+
|
107 |
+
### BigBench
|
108 |
+
|
109 |
+
| Task |Version| Metric | Value | | Stderr |
|
110 |
+
|------------------------------------|------:|-----------------------|------:|---|-------:|
|
111 |
+
| bigbench_causal_judgement | 0| multiple_choice_grade | 47.89 |± | 3.63 |
|
112 |
+
| bigbench_date_understanding | 0| multiple_choice_grade | 39.02 |± | 2.54 |
|
113 |
+
| bigbench_disambiguation_qa | 0| multiple_choice_grade | 33.72 |± | 2.95 |
|
114 |
+
| bigbench_geometric_shapes | 0| multiple_choice_grade | 20.61 |± | 2.14 |
|
115 |
+
| bigbench_logical_deduction_five_objects| 0| multiple_choice_grade | 31.40 |± | 2.08 |
|
116 |
+
| bigbench_logical_deduction_seven_objects| 0| multiple_choice_grade | 23.71 |± | 1.61 |
|
117 |
+
| bigbench_logical_deduction_three_objects| 0| multiple_choice_grade | 47.00 |± | 2.89 |
|
118 |
+
| bigbench_movie_recommendation | 0| multiple_choice_grade | 27.40 |± | 1.99 |
|
119 |
+
| bigbench_navigate | 0| multiple_choice_grade | 50.10 |± | 1.58 |
|
120 |
+
| bigbench_reasoning_about_colored_objects| 0| multiple_choice_grade | 38.40 |± | 1.09 |
|
121 |
+
| bigbench_ruin_names | 0| multiple_choice_grade | 27.23 |± | 2.11 |
|
122 |
+
| bigbench_salient_translation_error_detection| 0| multiple_choice_grade | 25.45 |± | 1.38 |
|
123 |
+
| bigbench_snarks | 0| multiple_choice_grade | 46.41 |± | 3.72 |
|
124 |
+
| bigbench_sports_understanding | 0| multiple_choice_grade | 50.30 |± | 1.59 |
|
125 |
+
| bigbench_temporal_sequences | 0| multiple_choice_grade | 37.30 |± | 1.53 |
|
126 |
+
| bigbench_tracking_shuffled_objects_five_objects| 0| multiple_choice_grade | 21.36 |± | 1.16 |
|
127 |
+
| bigbench_tracking_shuffled_objects_seven_objects| 0| multiple_choice_grade | 17.14 |± | 0.90 |
|
128 |
+
| bigbench_tracking_shuffled_objects_three_objects| 0| multiple_choice_grade | 47.00 |± | 2.89 |
|
129 |
+
|
130 |
+
**Average score**: 35.38%
|
131 |
+
|
132 |
+
|