Update README.md
Browse files
README.md
CHANGED
@@ -5,4 +5,48 @@ datasets:
|
|
5 |
base_model:
|
6 |
- meta-llama/Llama-3.2-1B-Instruct
|
7 |
library_name: transformers
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
base_model:
|
6 |
- meta-llama/Llama-3.2-1B-Instruct
|
7 |
library_name: transformers
|
8 |
+
---
|
9 |
+
|
10 |
+
# This model's benchmark results
|
11 |
+
|
12 |
+
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|
13 |
+
|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
|
14 |
+
|tinyBenchmarks | N/A| | | | | | | |
|
15 |
+
| - tinyArc | 0|none | 25|acc_norm |↑ |0.4253|± | N/A|
|
16 |
+
| - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.3768|± | N/A|
|
17 |
+
| | |strict-match | 5|exact_match|↑ |0.3768|± | N/A|
|
18 |
+
| - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.5379|± | N/A|
|
19 |
+
| - tinyMMLU | 0|none | 0|acc_norm |↑ |0.4483|± | N/A|
|
20 |
+
| - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4217|± | N/A|
|
21 |
+
| - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.5366|± | N/A|
|
22 |
+
|
23 |
+
# Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results
|
24 |
+
|
25 |
+
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|
26 |
+
|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
|
27 |
+
|tinyBenchmarks | N/A| | | | | | | |
|
28 |
+
| - tinyArc | 0|none | 25|acc_norm |↑ |0.4145|± | N/A|
|
29 |
+
| - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.3412|± | N/A|
|
30 |
+
| | |strict-match | 5|exact_match|↑ |0.3412|± | N/A|
|
31 |
+
| - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.5335|± | N/A|
|
32 |
+
| - tinyMMLU | 0|none | 0|acc_norm |↑ |0.4298|± | N/A|
|
33 |
+
| - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4288|± | N/A|
|
34 |
+
| - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.5366|± | N/A|
|
35 |
+
|
36 |
+
Below is a side-by-side comparison of the two result sets. For each task, the higher value (i.e., “better” on that metric) is highlighted in **bold**:
|
37 |
+
|
38 |
+
| Task | this | orig | Better? |
|
39 |
+
|--------------------|--------|--------|------------------------|
|
40 |
+
| tinyArc (acc_norm) | **0.4253** | 0.4145 | v1 higher |
|
41 |
+
| tinyGSM8k (exact_match) | **0.3768** | 0.3412 | v1 higher |
|
42 |
+
| tinyHellaswag (acc_norm) | **0.5379** | 0.5335 | v1 higher |
|
43 |
+
| tinyMMLU (acc_norm) | **0.4483** | 0.4298 | v1 higher |
|
44 |
+
| tinyTruthfulQA (acc) | 0.4217 | **0.4288** | v2 higher |
|
45 |
+
| tinyWinogrande (acc_norm) | 0.5366 | 0.5366 | tie |
|
46 |
+
|
47 |
+
### Observations
|
48 |
+
1. **Ours outperforms the original on four tasks** (tinyArc, tinyGSM8k, tinyHellaswag, tinyMMLU).
|
49 |
+
2. **The original outperforms ours on one task** (tinyTruthfulQA).
|
50 |
+
3. One task is a **tie** (tinyWinogrande).
|
51 |
+
|
52 |
+
Given these comparisons, **our results are stronger overall** because it has higher scores on the majority of tasks. The only exception is on tinyTruthfulQA, where the original scores slightly better, and on tinyWinogrande, both versions tie.
|