ericflo
/

Llama-3.2-1B-Instruct-RLHF-v0.1

 base_model:
 - meta-llama/Llama-3.2-1B-Instruct
 library_name: transformers
+---
+# This model's benchmark results
+|      Tasks      |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
+|tinyBenchmarks   |    N/A|                |      |           |   |      |   |      |
+| - tinyArc       |      0|none            |    25|acc_norm   |↑  |0.4253|±  |   N/A|
+| - tinyGSM8k     |      0|flexible-extract|     5|exact_match|↑  |0.3768|±  |   N/A|
+|                 |       |strict-match    |     5|exact_match|↑  |0.3768|±  |   N/A|
+| - tinyHellaswag |      0|none            |    10|acc_norm   |↑  |0.5379|±  |   N/A|
+| - tinyMMLU      |      0|none            |     0|acc_norm   |↑  |0.4483|±  |   N/A|
+| - tinyTruthfulQA|      0|none            |     0|acc        |↑  |0.4217|±  |   N/A|
+| - tinyWinogrande|      0|none            |     5|acc_norm   |↑  |0.5366|±  |   N/A|
+# Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results
+|      Tasks      |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
+|tinyBenchmarks   |    N/A|                |      |           |   |      |   |      |
+| - tinyArc       |      0|none            |    25|acc_norm   |↑  |0.4145|±  |   N/A|
+| - tinyGSM8k     |      0|flexible-extract|     5|exact_match|↑  |0.3412|±  |   N/A|
+|                 |       |strict-match    |     5|exact_match|↑  |0.3412|±  |   N/A|
+| - tinyHellaswag |      0|none            |    10|acc_norm   |↑  |0.5335|±  |   N/A|
+| - tinyMMLU      |      0|none            |     0|acc_norm   |↑  |0.4298|±  |   N/A|
+| - tinyTruthfulQA|      0|none            |     0|acc        |↑  |0.4288|±  |   N/A|
+| - tinyWinogrande|      0|none            |     5|acc_norm   |↑  |0.5366|±  |   N/A|
+Below is a side-by-side comparison of the two result sets. For each task, the higher value (i.e., “better” on that metric) is highlighted in **bold**:
+|      Task          | this   | orig   | Better?                |
+|--------------------|--------|--------|------------------------|
+| tinyArc (acc_norm) | **0.4253** | 0.4145 | v1 higher            |
+| tinyGSM8k (exact_match) | **0.3768** | 0.3412 | v1 higher            |
+| tinyHellaswag (acc_norm) | **0.5379** | 0.5335 | v1 higher            |
+| tinyMMLU (acc_norm) | **0.4483** | 0.4298 | v1 higher            |
+| tinyTruthfulQA (acc)    | 0.4217 | **0.4288** | v2 higher            |
+| tinyWinogrande (acc_norm) | 0.5366 | 0.5366 | tie                  |
+### Observations
+1. **Ours outperforms the original on four tasks** (tinyArc, tinyGSM8k, tinyHellaswag, tinyMMLU).
+2. **The original outperforms ours on one task** (tinyTruthfulQA).
+3. One task is a **tie** (tinyWinogrande).
+Given these comparisons, **our results are stronger overall** because it has higher scores on the majority of tasks. The only exception is on tinyTruthfulQA, where the original scores slightly better, and on tinyWinogrande, both versions tie.