Llama-3-8B-Instruct-GPTQ-4bit / README.md

Update README.md

46762df verified 7 months ago

16.1 kB

	---
	license: llama3
	inference: false
	---

	# Description
	4 bit quantization of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using GPTQ. We use the config below for quantization/evaluation and [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) as the calibration data. The code is available under [this repository.](https://github.com/IST-DASLab/marlin/tree/2f6d7c10e124b3c5fa29ff8d77d568bd7af3274c/gptq)

	```yaml
	bits: 4
	damp_percent: 0.01
	desc_act: true
	exllama_config:
	version: 2
	group_size: 128
	quant_method: gptq
	static_groups: false
	sym: true
	true_sequential: true
	```

	## Evaluations

	Below is a comprehensive evaluation and also comparison with [casperhansen/llama-3-8b-instruct-awq](https://huggingface.co/casperhansen/llama-3-8b-instruct-awq) using the awesome [mosaicml/llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval).

	\| model_name \| core_average \| world_knowledge \| commonsense_reasoning \| language_understanding \| symbolic_problem_solving \| reading_comprehension \|
	\| :---------------------------------------- \| -----------: \| --------------: \| --------------------: \| ---------------------: \| -----------------------: \| --------------------: \|
	\| ISTA-DASLab/Llama-3-8B-Instruct-GPTQ-4bit \| 0.552944 \| 0.584061 \| 0.547598 \| 0.663904 \| 0.431017 \| 0.538141 \|
	\| casperhansen/llama-3-8b-instruct-awq \| 0.531504 \| 0.557663 \| 0.528201 \| 0.657211 \| 0.391476 \| 0.522971 \|

	\| Category \| Benchmark \| Subtask \| Accuracy GPTQ \| Accuracy AWQ \| Number few shot \|
	\| :----------------------- \| :--------------------------- \| :---------------------------------- \| ------------: \| -----------: \| :-------------- \|
	\| symbolic_problem_solving \| gsm8k \| \| 0.721759 \| 0.59818 \| 0-shot \|
	\| commonsense_reasoning \| copa \| \| 0.85 \| 0.84 \| 0-shot \|
	\| commonsense_reasoning \| commonsense_qa \| \| 0.78706 \| 0.782146 \| 0-shot \|
	\| commonsense_reasoning \| piqa \| \| 0.784004 \| 0.781828 \| 0-shot \|
	\| commonsense_reasoning \| bigbench_strange_stories \| \| 0.764368 \| 0.752874 \| 0-shot \|
	\| commonsense_reasoning \| bigbench_strategy_qa \| \| 0.680647 \| 0.659677 \| 0-shot \|
	\| language_understanding \| lambada_openai \| \| 0.716476 \| 0.717834 \| 0-shot \|
	\| language_understanding \| hellaswag \| \| 0.750647 \| 0.753137 \| 0-shot \|
	\| reading_comprehension \| coqa \| \| 0.198797 \| 0.109733 \| 0-shot \|
	\| reading_comprehension \| boolq \| \| 0.8263 \| 0.836391 \| 0-shot \|
	\| world_knowledge \| triviaqa_sm_sub \| \| 0.590667 \| 0.511333 \| 3-shot \|
	\| world_knowledge \| jeopardy \| Average \| 0.4975 \| 0.489451 \| 3-shot \|
	\| world_knowledge \| \| american_history \| 0.535109 \| 0.544794 \| 3-shot \|
	\| world_knowledge \| \| literature \| 0.622449 \| 0.626531 \| 3-shot \|
	\| world_knowledge \| \| science \| 0.420168 \| 0.390756 \| 3-shot \|
	\| world_knowledge \| \| word_origins \| 0.293151 \| 0.271233 \| 3-shot \|
	\| world_knowledge \| \| world_history \| 0.616622 \| 0.613941 \| 3-shot \|
	\| world_knowledge \| bigbench_qa_wikidata \| \| 0.684366 \| 0.644358 \| 3-shot \|
	\| world_knowledge \| arc_easy \| \| 0.808923 \| 0.808081 \| 3-shot \|
	\| world_knowledge \| arc_challenge \| \| 0.571672 \| 0.571672 \| 3-shot \|
	\| commonsense_reasoning \| siqa \| \| 0.827533 \| 0.814227 \| 3-shot \|
	\| language_understanding \| winograd \| \| 0.871795 \| 0.860806 \| 3-shot \|
	\| symbolic_problem_solving \| bigbench_operators \| \| 0.547619 \| 0.552381 \| 3-shot \|
	\| reading_comprehension \| squad \| \| 0.581552 \| 0.58789 \| 3-shot \|
	\| symbolic_problem_solving \| svamp \| \| 0.68 \| 0.57 \| 5-shot \|
	\| world_knowledge \| mmlu \| Average \| 0.668279 \| 0.645874 \| 5-shot \|
	\| world_knowledge \| \| abstract_algebra \| 0.29 \| 0.33 \| 5-shot \|
	\| world_knowledge \| \| anatomy \| 0.681481 \| 0.651852 \| 5-shot \|
	\| world_knowledge \| \| astronomy \| 0.703947 \| 0.671053 \| 5-shot \|
	\| world_knowledge \| \| business_ethics \| 0.67 \| 0.68 \| 5-shot \|
	\| world_knowledge \| \| clinical_knowledge \| 0.750943 \| 0.701887 \| 5-shot \|
	\| world_knowledge \| \| college_biology \| 0.784722 \| 0.729167 \| 5-shot \|
	\| world_knowledge \| \| college_chemistry \| 0.47 \| 0.46 \| 5-shot \|
	\| world_knowledge \| \| college_computer_science \| 0.56 \| 0.54 \| 5-shot \|
	\| world_knowledge \| \| college_mathematics \| 0.36 \| 0.28 \| 5-shot \|
	\| world_knowledge \| \| college_medicine \| 0.653179 \| 0.635838 \| 5-shot \|
	\| world_knowledge \| \| college_physics \| 0.5 \| 0.431373 \| 5-shot \|
	\| world_knowledge \| \| computer_security \| 0.78 \| 0.75 \| 5-shot \|
	\| world_knowledge \| \| conceptual_physics \| 0.548936 \| 0.557447 \| 5-shot \|
	\| world_knowledge \| \| econometrics \| 0.45614 \| 0.482456 \| 5-shot \|
	\| world_knowledge \| \| electrical_engineering \| 0.668966 \| 0.586207 \| 5-shot \|
	\| world_knowledge \| \| elementary_mathematics \| 0.439153 \| 0.417989 \| 5-shot \|
	\| world_knowledge \| \| formal_logic \| 0.47619 \| 0.412698 \| 5-shot \|
	\| world_knowledge \| \| global_facts \| 0.37 \| 0.41 \| 5-shot \|
	\| world_knowledge \| \| high_school_biology \| 0.790323 \| 0.754839 \| 5-shot \|
	\| world_knowledge \| \| high_school_chemistry \| 0.581281 \| 0.507389 \| 5-shot \|
	\| world_knowledge \| \| high_school_computer_science \| 0.71 \| 0.74 \| 5-shot \|
	\| world_knowledge \| \| high_school_european_history \| 0.745455 \| 0.775758 \| 5-shot \|
	\| world_knowledge \| \| high_school_geography \| 0.823232 \| 0.823232 \| 5-shot \|
	\| world_knowledge \| \| high_school_government_and_politics \| 0.917098 \| 0.875648 \| 5-shot \|
	\| world_knowledge \| \| high_school_macroeconomics \| 0.635897 \| 0.620513 \| 5-shot \|
	\| world_knowledge \| \| high_school_mathematics \| 0.407407 \| 0.392593 \| 5-shot \|
	\| world_knowledge \| \| high_school_microeconomics \| 0.726891 \| 0.714286 \| 5-shot \|
	\| world_knowledge \| \| high_school_physics \| 0.423841 \| 0.410596 \| 5-shot \|
	\| world_knowledge \| \| high_school_psychology \| 0.842202 \| 0.838532 \| 5-shot \|
	\| world_knowledge \| \| high_school_statistics \| 0.592593 \| 0.513889 \| 5-shot \|
	\| world_knowledge \| \| high_school_us_history \| 0.852941 \| 0.852941 \| 5-shot \|
	\| world_knowledge \| \| high_school_world_history \| 0.843882 \| 0.831224 \| 5-shot \|
	\| world_knowledge \| \| human_aging \| 0.717489 \| 0.713004 \| 5-shot \|
	\| world_knowledge \| \| human_sexuality \| 0.763359 \| 0.70229 \| 5-shot \|
	\| world_knowledge \| \| international_law \| 0.793388 \| 0.77686 \| 5-shot \|
	\| world_knowledge \| \| jurisprudence \| 0.814815 \| 0.768519 \| 5-shot \|
	\| world_knowledge \| \| logical_fallacies \| 0.754601 \| 0.773006 \| 5-shot \|
	\| world_knowledge \| \| machine_learning \| 0.553571 \| 0.508929 \| 5-shot \|
	\| world_knowledge \| \| management \| 0.84466 \| 0.834951 \| 5-shot \|
	\| world_knowledge \| \| marketing \| 0.92735 \| 0.888889 \| 5-shot \|
	\| world_knowledge \| \| medical_genetics \| 0.81 \| 0.78 \| 5-shot \|
	\| world_knowledge \| \| miscellaneous \| 0.825032 \| 0.799489 \| 5-shot \|
	\| world_knowledge \| \| moral_disputes \| 0.739884 \| 0.722543 \| 5-shot \|
	\| world_knowledge \| \| moral_scenarios \| 0.437989 \| 0.38324 \| 5-shot \|
	\| world_knowledge \| \| nutrition \| 0.764706 \| 0.735294 \| 5-shot \|
	\| world_knowledge \| \| philosophy \| 0.733119 \| 0.713826 \| 5-shot \|
	\| world_knowledge \| \| prehistory \| 0.719136 \| 0.719136 \| 5-shot \|
	\| world_knowledge \| \| professional_accounting \| 0.475177 \| 0.485816 \| 5-shot \|
	\| world_knowledge \| \| professional_law \| 0.480443 \| 0.449153 \| 5-shot \|
	\| world_knowledge \| \| professional_medicine \| 0.709559 \| 0.676471 \| 5-shot \|
	\| world_knowledge \| \| professional_psychology \| 0.694444 \| 0.676471 \| 5-shot \|
	\| world_knowledge \| \| public_relations \| 0.7 \| 0.6 \| 5-shot \|
	\| world_knowledge \| \| security_studies \| 0.730612 \| 0.718367 \| 5-shot \|
	\| world_knowledge \| \| sociology \| 0.830846 \| 0.845771 \| 5-shot \|
	\| world_knowledge \| \| us_foreign_policy \| 0.86 \| 0.85 \| 5-shot \|
	\| world_knowledge \| \| virology \| 0.542169 \| 0.518072 \| 5-shot \|
	\| world_knowledge \| \| world_religions \| 0.812865 \| 0.795322 \| 5-shot \|
	\| symbolic_problem_solving \| bigbench_dyck_languages \| \| 0.086 \| 0.045 \| 5-shot \|
	\| language_understanding \| winogrande \| \| 0.764009 \| 0.759274 \| 5-shot \|
	\| symbolic_problem_solving \| agi_eval_lsat_ar \| \| 0.3 \| 0.278261 \| 5-shot \|
	\| symbolic_problem_solving \| simple_arithmetic_nospaces \| \| 0.466 \| 0.458 \| 5-shot \|
	\| symbolic_problem_solving \| simple_arithmetic_withspaces \| \| 0.502 \| 0.496 \| 5-shot \|
	\| reading_comprehension \| agi_eval_lsat_rc \| \| 0.731343 \| 0.708955 \| 5-shot \|
	\| reading_comprehension \| agi_eval_lsat_lr \| \| 0.554902 \| 0.560784 \| 5-shot \|
	\| reading_comprehension \| agi_eval_sat_en \| \| 0.81068 \| 0.805825 \| 5-shot \|
	\| world_knowledge \| arc_challenge \| \| 0.582765 \| 0.591297 \| 25-shot \|
	\| commonsense_reasoning \| openbook_qa \| \| 0.478 \| 0.468 \| 10-shot \|
	\| language_understanding \| hellaswag \| \| 0.769468 \| 0.771062 \| 10-shot \|
	\| \| bigbench_cs_algorithms \| \| 0.715151 \| 0.687879 \| 10-shot \|
	\| symbolic_problem_solving \| bigbench_elementary_math_qa \| \| 0.533569 \| 0.530922 \| 1-shot \|

	---
	license: llama3
	inference: false
	---

	# Description
	4 bit quantization of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using GPTQ. We use the config below for quantization/evaluation and [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) as the calibration data. The code is available under [this repository.](https://github.com/IST-DASLab/marlin/tree/2f6d7c10e124b3c5fa29ff8d77d568bd7af3274c/gptq)

	```yaml
	bits: 4
	damp_percent: 0.01
	desc_act: true
	exllama_config:
	version: 2
	group_size: 128
	quant_method: gptq
	static_groups: false
	sym: true
	true_sequential: true
	```

	## Evaluations

	Below is a comprehensive evaluation and also comparison with [casperhansen/llama-3-8b-instruct-awq](https://huggingface.co/casperhansen/llama-3-8b-instruct-awq) using the awesome [mosaicml/llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval).

	\| model_name \| core_average \| world_knowledge \| commonsense_reasoning \| language_understanding \| symbolic_problem_solving \| reading_comprehension \|
	\| :---------------------------------------- \| -----------: \| --------------: \| --------------------: \| ---------------------: \| -----------------------: \| --------------------: \|
	\| ISTA-DASLab/Llama-3-8B-Instruct-GPTQ-4bit \| 0.552944 \| 0.584061 \| 0.547598 \| 0.663904 \| 0.431017 \| 0.538141 \|
	\| casperhansen/llama-3-8b-instruct-awq \| 0.531504 \| 0.557663 \| 0.528201 \| 0.657211 \| 0.391476 \| 0.522971 \|

	\| Category \| Benchmark \| Subtask \| Accuracy GPTQ \| Accuracy AWQ \| Number few shot \|
	\| :----------------------- \| :--------------------------- \| :---------------------------------- \| ------------: \| -----------: \| :-------------- \|
	\| symbolic_problem_solving \| gsm8k \| \| 0.721759 \| 0.59818 \| 0-shot \|
	\| commonsense_reasoning \| copa \| \| 0.85 \| 0.84 \| 0-shot \|
	\| commonsense_reasoning \| commonsense_qa \| \| 0.78706 \| 0.782146 \| 0-shot \|
	\| commonsense_reasoning \| piqa \| \| 0.784004 \| 0.781828 \| 0-shot \|
	\| commonsense_reasoning \| bigbench_strange_stories \| \| 0.764368 \| 0.752874 \| 0-shot \|
	\| commonsense_reasoning \| bigbench_strategy_qa \| \| 0.680647 \| 0.659677 \| 0-shot \|
	\| language_understanding \| lambada_openai \| \| 0.716476 \| 0.717834 \| 0-shot \|
	\| language_understanding \| hellaswag \| \| 0.750647 \| 0.753137 \| 0-shot \|
	\| reading_comprehension \| coqa \| \| 0.198797 \| 0.109733 \| 0-shot \|
	\| reading_comprehension \| boolq \| \| 0.8263 \| 0.836391 \| 0-shot \|
	\| world_knowledge \| triviaqa_sm_sub \| \| 0.590667 \| 0.511333 \| 3-shot \|
	\| world_knowledge \| jeopardy \| Average \| 0.4975 \| 0.489451 \| 3-shot \|
	\| world_knowledge \| \| american_history \| 0.535109 \| 0.544794 \| 3-shot \|
	\| world_knowledge \| \| literature \| 0.622449 \| 0.626531 \| 3-shot \|
	\| world_knowledge \| \| science \| 0.420168 \| 0.390756 \| 3-shot \|
	\| world_knowledge \| \| word_origins \| 0.293151 \| 0.271233 \| 3-shot \|
	\| world_knowledge \| \| world_history \| 0.616622 \| 0.613941 \| 3-shot \|
	\| world_knowledge \| bigbench_qa_wikidata \| \| 0.684366 \| 0.644358 \| 3-shot \|
	\| world_knowledge \| arc_easy \| \| 0.808923 \| 0.808081 \| 3-shot \|
	\| world_knowledge \| arc_challenge \| \| 0.571672 \| 0.571672 \| 3-shot \|
	\| commonsense_reasoning \| siqa \| \| 0.827533 \| 0.814227 \| 3-shot \|
	\| language_understanding \| winograd \| \| 0.871795 \| 0.860806 \| 3-shot \|
	\| symbolic_problem_solving \| bigbench_operators \| \| 0.547619 \| 0.552381 \| 3-shot \|
	\| reading_comprehension \| squad \| \| 0.581552 \| 0.58789 \| 3-shot \|
	\| symbolic_problem_solving \| svamp \| \| 0.68 \| 0.57 \| 5-shot \|
	\| world_knowledge \| mmlu \| Average \| 0.668279 \| 0.645874 \| 5-shot \|
	\| world_knowledge \| \| abstract_algebra \| 0.29 \| 0.33 \| 5-shot \|
	\| world_knowledge \| \| anatomy \| 0.681481 \| 0.651852 \| 5-shot \|
	\| world_knowledge \| \| astronomy \| 0.703947 \| 0.671053 \| 5-shot \|
	\| world_knowledge \| \| business_ethics \| 0.67 \| 0.68 \| 5-shot \|
	\| world_knowledge \| \| clinical_knowledge \| 0.750943 \| 0.701887 \| 5-shot \|
	\| world_knowledge \| \| college_biology \| 0.784722 \| 0.729167 \| 5-shot \|
	\| world_knowledge \| \| college_chemistry \| 0.47 \| 0.46 \| 5-shot \|
	\| world_knowledge \| \| college_computer_science \| 0.56 \| 0.54 \| 5-shot \|
	\| world_knowledge \| \| college_mathematics \| 0.36 \| 0.28 \| 5-shot \|
	\| world_knowledge \| \| college_medicine \| 0.653179 \| 0.635838 \| 5-shot \|
	\| world_knowledge \| \| college_physics \| 0.5 \| 0.431373 \| 5-shot \|
	\| world_knowledge \| \| computer_security \| 0.78 \| 0.75 \| 5-shot \|
	\| world_knowledge \| \| conceptual_physics \| 0.548936 \| 0.557447 \| 5-shot \|
	\| world_knowledge \| \| econometrics \| 0.45614 \| 0.482456 \| 5-shot \|
	\| world_knowledge \| \| electrical_engineering \| 0.668966 \| 0.586207 \| 5-shot \|
	\| world_knowledge \| \| elementary_mathematics \| 0.439153 \| 0.417989 \| 5-shot \|
	\| world_knowledge \| \| formal_logic \| 0.47619 \| 0.412698 \| 5-shot \|
	\| world_knowledge \| \| global_facts \| 0.37 \| 0.41 \| 5-shot \|
	\| world_knowledge \| \| high_school_biology \| 0.790323 \| 0.754839 \| 5-shot \|
	\| world_knowledge \| \| high_school_chemistry \| 0.581281 \| 0.507389 \| 5-shot \|
	\| world_knowledge \| \| high_school_computer_science \| 0.71 \| 0.74 \| 5-shot \|
	\| world_knowledge \| \| high_school_european_history \| 0.745455 \| 0.775758 \| 5-shot \|
	\| world_knowledge \| \| high_school_geography \| 0.823232 \| 0.823232 \| 5-shot \|
	\| world_knowledge \| \| high_school_government_and_politics \| 0.917098 \| 0.875648 \| 5-shot \|
	\| world_knowledge \| \| high_school_macroeconomics \| 0.635897 \| 0.620513 \| 5-shot \|
	\| world_knowledge \| \| high_school_mathematics \| 0.407407 \| 0.392593 \| 5-shot \|
	\| world_knowledge \| \| high_school_microeconomics \| 0.726891 \| 0.714286 \| 5-shot \|
	\| world_knowledge \| \| high_school_physics \| 0.423841 \| 0.410596 \| 5-shot \|
	\| world_knowledge \| \| high_school_psychology \| 0.842202 \| 0.838532 \| 5-shot \|
	\| world_knowledge \| \| high_school_statistics \| 0.592593 \| 0.513889 \| 5-shot \|
	\| world_knowledge \| \| high_school_us_history \| 0.852941 \| 0.852941 \| 5-shot \|
	\| world_knowledge \| \| high_school_world_history \| 0.843882 \| 0.831224 \| 5-shot \|
	\| world_knowledge \| \| human_aging \| 0.717489 \| 0.713004 \| 5-shot \|
	\| world_knowledge \| \| human_sexuality \| 0.763359 \| 0.70229 \| 5-shot \|
	\| world_knowledge \| \| international_law \| 0.793388 \| 0.77686 \| 5-shot \|
	\| world_knowledge \| \| jurisprudence \| 0.814815 \| 0.768519 \| 5-shot \|
	\| world_knowledge \| \| logical_fallacies \| 0.754601 \| 0.773006 \| 5-shot \|
	\| world_knowledge \| \| machine_learning \| 0.553571 \| 0.508929 \| 5-shot \|
	\| world_knowledge \| \| management \| 0.84466 \| 0.834951 \| 5-shot \|
	\| world_knowledge \| \| marketing \| 0.92735 \| 0.888889 \| 5-shot \|
	\| world_knowledge \| \| medical_genetics \| 0.81 \| 0.78 \| 5-shot \|
	\| world_knowledge \| \| miscellaneous \| 0.825032 \| 0.799489 \| 5-shot \|
	\| world_knowledge \| \| moral_disputes \| 0.739884 \| 0.722543 \| 5-shot \|
	\| world_knowledge \| \| moral_scenarios \| 0.437989 \| 0.38324 \| 5-shot \|
	\| world_knowledge \| \| nutrition \| 0.764706 \| 0.735294 \| 5-shot \|
	\| world_knowledge \| \| philosophy \| 0.733119 \| 0.713826 \| 5-shot \|
	\| world_knowledge \| \| prehistory \| 0.719136 \| 0.719136 \| 5-shot \|
	\| world_knowledge \| \| professional_accounting \| 0.475177 \| 0.485816 \| 5-shot \|
	\| world_knowledge \| \| professional_law \| 0.480443 \| 0.449153 \| 5-shot \|
	\| world_knowledge \| \| professional_medicine \| 0.709559 \| 0.676471 \| 5-shot \|
	\| world_knowledge \| \| professional_psychology \| 0.694444 \| 0.676471 \| 5-shot \|
	\| world_knowledge \| \| public_relations \| 0.7 \| 0.6 \| 5-shot \|
	\| world_knowledge \| \| security_studies \| 0.730612 \| 0.718367 \| 5-shot \|
	\| world_knowledge \| \| sociology \| 0.830846 \| 0.845771 \| 5-shot \|
	\| world_knowledge \| \| us_foreign_policy \| 0.86 \| 0.85 \| 5-shot \|
	\| world_knowledge \| \| virology \| 0.542169 \| 0.518072 \| 5-shot \|
	\| world_knowledge \| \| world_religions \| 0.812865 \| 0.795322 \| 5-shot \|
	\| symbolic_problem_solving \| bigbench_dyck_languages \| \| 0.086 \| 0.045 \| 5-shot \|
	\| language_understanding \| winogrande \| \| 0.764009 \| 0.759274 \| 5-shot \|
	\| symbolic_problem_solving \| agi_eval_lsat_ar \| \| 0.3 \| 0.278261 \| 5-shot \|
	\| symbolic_problem_solving \| simple_arithmetic_nospaces \| \| 0.466 \| 0.458 \| 5-shot \|
	\| symbolic_problem_solving \| simple_arithmetic_withspaces \| \| 0.502 \| 0.496 \| 5-shot \|
	\| reading_comprehension \| agi_eval_lsat_rc \| \| 0.731343 \| 0.708955 \| 5-shot \|
	\| reading_comprehension \| agi_eval_lsat_lr \| \| 0.554902 \| 0.560784 \| 5-shot \|
	\| reading_comprehension \| agi_eval_sat_en \| \| 0.81068 \| 0.805825 \| 5-shot \|
	\| world_knowledge \| arc_challenge \| \| 0.582765 \| 0.591297 \| 25-shot \|
	\| commonsense_reasoning \| openbook_qa \| \| 0.478 \| 0.468 \| 10-shot \|
	\| language_understanding \| hellaswag \| \| 0.769468 \| 0.771062 \| 10-shot \|
	\| \| bigbench_cs_algorithms \| \| 0.715151 \| 0.687879 \| 10-shot \|
	\| symbolic_problem_solving \| bigbench_elementary_math_qa \| \| 0.533569 \| 0.530922 \| 1-shot \|