Update README.md

d25b955 verified 2 days ago

3.2 kB

	---
	license: apache-2.0
	datasets:
	- OpenRLHF/prompt-collection-v0.1
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	library_name: transformers
	---

	# This model's benchmark results

	\| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----------------\|-------\|----------------\|-----:\|-----------\|---\|-----:\|---\|------\|
	\|tinyBenchmarks \| N/A\| \| \| \| \| \| \| \|
	\| - tinyArc \| 0\|none \| 25\|acc_norm \|↑ \|0.4253\|± \| N/A\|
	\| - tinyGSM8k \| 0\|flexible-extract\| 5\|exact_match\|↑ \|0.3768\|± \| N/A\|
	\| \| \|strict-match \| 5\|exact_match\|↑ \|0.3768\|± \| N/A\|
	\| - tinyHellaswag \| 0\|none \| 10\|acc_norm \|↑ \|0.5379\|± \| N/A\|
	\| - tinyMMLU \| 0\|none \| 0\|acc_norm \|↑ \|0.4483\|± \| N/A\|
	\| - tinyTruthfulQA\| 0\|none \| 0\|acc \|↑ \|0.4217\|± \| N/A\|
	\| - tinyWinogrande\| 0\|none \| 5\|acc_norm \|↑ \|0.5366\|± \| N/A\|

	# Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results

	\| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----------------\|-------\|----------------\|-----:\|-----------\|---\|-----:\|---\|------\|
	\|tinyBenchmarks \| N/A\| \| \| \| \| \| \| \|
	\| - tinyArc \| 0\|none \| 25\|acc_norm \|↑ \|0.4145\|± \| N/A\|
	\| - tinyGSM8k \| 0\|flexible-extract\| 5\|exact_match\|↑ \|0.3412\|± \| N/A\|
	\| \| \|strict-match \| 5\|exact_match\|↑ \|0.3412\|± \| N/A\|
	\| - tinyHellaswag \| 0\|none \| 10\|acc_norm \|↑ \|0.5335\|± \| N/A\|
	\| - tinyMMLU \| 0\|none \| 0\|acc_norm \|↑ \|0.4298\|± \| N/A\|
	\| - tinyTruthfulQA\| 0\|none \| 0\|acc \|↑ \|0.4288\|± \| N/A\|
	\| - tinyWinogrande\| 0\|none \| 5\|acc_norm \|↑ \|0.5366\|± \| N/A\|

	Below is a side-by-side comparison of the two result sets. For each task, the higher value (i.e., “better” on that metric) is highlighted in bold:

	\| Task \| this \| orig \| Better? \|
	\|--------------------\|--------\|--------\|------------------------\|
	\| tinyArc (acc_norm) \| 0.4253 \| 0.4145 \| v1 higher \|
	\| tinyGSM8k (exact_match) \| 0.3768 \| 0.3412 \| v1 higher \|
	\| tinyHellaswag (acc_norm) \| 0.5379 \| 0.5335 \| v1 higher \|
	\| tinyMMLU (acc_norm) \| 0.4483 \| 0.4298 \| v1 higher \|
	\| tinyTruthfulQA (acc) \| 0.4217 \| 0.4288 \| v2 higher \|
	\| tinyWinogrande (acc_norm) \| 0.5366 \| 0.5366 \| tie \|

	### Observations
	1. Ours outperforms the original on four tasks (tinyArc, tinyGSM8k, tinyHellaswag, tinyMMLU).
	2. The original outperforms ours on one task (tinyTruthfulQA).
	3. One task is a tie (tinyWinogrande).

	Given these comparisons, our results are stronger overall because it has higher scores on the majority of tasks. The only exception is on tinyTruthfulQA, where the original scores slightly better, and on tinyWinogrande, both versions tie.

	---
	license: apache-2.0
	datasets:
	- OpenRLHF/prompt-collection-v0.1
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	library_name: transformers
	---

	# This model's benchmark results

	\| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----------------\|-------\|----------------\|-----:\|-----------\|---\|-----:\|---\|------\|
	\|tinyBenchmarks \| N/A\| \| \| \| \| \| \| \|
	\| - tinyArc \| 0\|none \| 25\|acc_norm \|↑ \|0.4253\|± \| N/A\|
	\| - tinyGSM8k \| 0\|flexible-extract\| 5\|exact_match\|↑ \|0.3768\|± \| N/A\|
	\| \| \|strict-match \| 5\|exact_match\|↑ \|0.3768\|± \| N/A\|
	\| - tinyHellaswag \| 0\|none \| 10\|acc_norm \|↑ \|0.5379\|± \| N/A\|
	\| - tinyMMLU \| 0\|none \| 0\|acc_norm \|↑ \|0.4483\|± \| N/A\|
	\| - tinyTruthfulQA\| 0\|none \| 0\|acc \|↑ \|0.4217\|± \| N/A\|
	\| - tinyWinogrande\| 0\|none \| 5\|acc_norm \|↑ \|0.5366\|± \| N/A\|

	# Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results

	\| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----------------\|-------\|----------------\|-----:\|-----------\|---\|-----:\|---\|------\|
	\|tinyBenchmarks \| N/A\| \| \| \| \| \| \| \|
	\| - tinyArc \| 0\|none \| 25\|acc_norm \|↑ \|0.4145\|± \| N/A\|
	\| - tinyGSM8k \| 0\|flexible-extract\| 5\|exact_match\|↑ \|0.3412\|± \| N/A\|
	\| \| \|strict-match \| 5\|exact_match\|↑ \|0.3412\|± \| N/A\|
	\| - tinyHellaswag \| 0\|none \| 10\|acc_norm \|↑ \|0.5335\|± \| N/A\|
	\| - tinyMMLU \| 0\|none \| 0\|acc_norm \|↑ \|0.4298\|± \| N/A\|
	\| - tinyTruthfulQA\| 0\|none \| 0\|acc \|↑ \|0.4288\|± \| N/A\|
	\| - tinyWinogrande\| 0\|none \| 5\|acc_norm \|↑ \|0.5366\|± \| N/A\|

	Below is a side-by-side comparison of the two result sets. For each task, the higher value (i.e., “better” on that metric) is highlighted in bold:

	\| Task \| this \| orig \| Better? \|
	\|--------------------\|--------\|--------\|------------------------\|
	\| tinyArc (acc_norm) \| 0.4253 \| 0.4145 \| v1 higher \|
	\| tinyGSM8k (exact_match) \| 0.3768 \| 0.3412 \| v1 higher \|
	\| tinyHellaswag (acc_norm) \| 0.5379 \| 0.5335 \| v1 higher \|
	\| tinyMMLU (acc_norm) \| 0.4483 \| 0.4298 \| v1 higher \|
	\| tinyTruthfulQA (acc) \| 0.4217 \| 0.4288 \| v2 higher \|
	\| tinyWinogrande (acc_norm) \| 0.5366 \| 0.5366 \| tie \|

	### Observations
	1. Ours outperforms the original on four tasks (tinyArc, tinyGSM8k, tinyHellaswag, tinyMMLU).
	2. The original outperforms ours on one task (tinyTruthfulQA).
	3. One task is a tie (tinyWinogrande).

	Given these comparisons, our results are stronger overall because it has higher scores on the majority of tasks. The only exception is on tinyTruthfulQA, where the original scores slightly better, and on tinyWinogrande, both versions tie.