ericflo commited on
Commit
d25b955
1 Parent(s): d77fedf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -1
README.md CHANGED
@@ -5,4 +5,48 @@ datasets:
5
  base_model:
6
  - meta-llama/Llama-3.2-1B-Instruct
7
  library_name: transformers
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - meta-llama/Llama-3.2-1B-Instruct
7
  library_name: transformers
8
+ ---
9
+
10
+ # This model's benchmark results
11
+
12
+ | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
13
+ |-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
14
+ |tinyBenchmarks | N/A| | | | | | | |
15
+ | - tinyArc | 0|none | 25|acc_norm |↑ |0.4253|± | N/A|
16
+ | - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.3768|± | N/A|
17
+ | | |strict-match | 5|exact_match|↑ |0.3768|± | N/A|
18
+ | - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.5379|± | N/A|
19
+ | - tinyMMLU | 0|none | 0|acc_norm |↑ |0.4483|± | N/A|
20
+ | - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4217|± | N/A|
21
+ | - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.5366|± | N/A|
22
+
23
+ # Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results
24
+
25
+ | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
26
+ |-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
27
+ |tinyBenchmarks | N/A| | | | | | | |
28
+ | - tinyArc | 0|none | 25|acc_norm |↑ |0.4145|± | N/A|
29
+ | - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.3412|± | N/A|
30
+ | | |strict-match | 5|exact_match|↑ |0.3412|± | N/A|
31
+ | - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.5335|± | N/A|
32
+ | - tinyMMLU | 0|none | 0|acc_norm |↑ |0.4298|± | N/A|
33
+ | - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4288|± | N/A|
34
+ | - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.5366|± | N/A|
35
+
36
+ Below is a side-by-side comparison of the two result sets. For each task, the higher value (i.e., “better” on that metric) is highlighted in **bold**:
37
+
38
+ | Task | this | orig | Better? |
39
+ |--------------------|--------|--------|------------------------|
40
+ | tinyArc (acc_norm) | **0.4253** | 0.4145 | v1 higher |
41
+ | tinyGSM8k (exact_match) | **0.3768** | 0.3412 | v1 higher |
42
+ | tinyHellaswag (acc_norm) | **0.5379** | 0.5335 | v1 higher |
43
+ | tinyMMLU (acc_norm) | **0.4483** | 0.4298 | v1 higher |
44
+ | tinyTruthfulQA (acc) | 0.4217 | **0.4288** | v2 higher |
45
+ | tinyWinogrande (acc_norm) | 0.5366 | 0.5366 | tie |
46
+
47
+ ### Observations
48
+ 1. **Ours outperforms the original on four tasks** (tinyArc, tinyGSM8k, tinyHellaswag, tinyMMLU).
49
+ 2. **The original outperforms ours on one task** (tinyTruthfulQA).
50
+ 3. One task is a **tie** (tinyWinogrande).
51
+
52
+ Given these comparisons, **our results are stronger overall** because it has higher scores on the majority of tasks. The only exception is on tinyTruthfulQA, where the original scores slightly better, and on tinyWinogrande, both versions tie.