Adding Evaluation Results (#2)

cf5ca47 verified 4 months ago

6.85 kB

	---
	language:
	- en
	license: llama3
	tags:
	- axolotl
	base_model: meta-llama/Meta-Llama-3-8B
	datasets:
	- BEE-spoke-data/KI-smorgasbord_fw-small
	pipeline_tag: text-generation
	model-index:
	- name: Llama-3-6.3b-v0.1
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 10.44
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Llama-3-6.3b-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 18.68
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Llama-3-6.3b-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 1.51
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Llama-3-6.3b-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 4.47
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Llama-3-6.3b-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 6.15
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Llama-3-6.3b-v0.1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 20.44
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Llama-3-6.3b-v0.1
	name: Open LLM Leaderboard
	---



	# Llama-3-6.3b-v0.1

	This is a layer pruning experiment based off of the original llama-3-8b:

	- 8 layers pruned with [PruneMe](https://github.com/pszemraj/PruneMe/tree/upgrades)/MergeKit
	- layers selected using [BEE-spoke-data/fineweb-100k_en-med](https://hf.co/datasets/BEE-spoke-data/fineweb-100k_en-med)
	- brief subsequent continued pretraining @ ctx 4096
	- data: 10k rows of FineWeb (different than pruning data) + some curated data
	- wandb [here](https://wandb.ai/pszemraj/llama3-pruning)

	## quick eval


	hf (pretrained=pszemraj/Llama-3-6.3b-v0.1,trust_remote_code=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

	\| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	\|--------------\|------:\|------\|-----:\|----------\|-----:\|---\|-----:\|
	\|arc_easy \| 1\|none \| 0\|acc \|0.7109\|± \|0.0093\|
	\| \| \|none \| 0\|acc_norm \|0.6843\|± \|0.0095\|
	\|boolq \| 2\|none \| 0\|acc \|0.7920\|± \|0.0071\|
	\|lambada_openai\| 1\|none \| 0\|perplexity\|4.5411\|± \|0.1073\|
	\| \| \|none \| 0\|acc \|0.6734\|± \|0.0065\|
	\|openbookqa \| 1\|none \| 0\|acc \|0.3000\|± \|0.0205\|
	\| \| \|none \| 0\|acc_norm \|0.4140\|± \|0.0220\|
	\|piqa \| 1\|none \| 0\|acc \|0.7443\|± \|0.0102\|
	\| \| \|none \| 0\|acc_norm \|0.7530\|± \|0.0101\|
	\|winogrande \| 1\|none \| 0\|acc \|0.7127\|± \|0.0127\|


	## Details

	[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.4.0`
	```yaml
	base_model: pszemraj/llama-3-prune_8
	model_type: LlamaForCausalLM
	tokenizer_type: AutoTokenizer

	strict: false
	seed: 80085

	# dataset
	datasets:
	- path: BEE-spoke-data/KI-smorgasbord_fw-small
	type: completion # format from earlier
	field: text # Optional[str] default: text, field to use for completion data
	val_set_size: 0.015

	sequence_len: 4096
	sample_packing: true
	pad_to_sequence_len: false
	train_on_inputs: false
	group_by_length: false

	# WANDB
	wandb_project: llama3-pruning
	wandb_entity: pszemraj
	wandb_watch: gradients
	wandb_name: Llama-3-6.3b-v0.1
	hub_model_id: pszemraj/Llama-3-6.3b-v0.1
	hub_strategy: every_save

	gradient_accumulation_steps: 16
	micro_batch_size: 1
	num_epochs: 1
	optimizer: adamw_torch_fused # paged_adamw_32bit
	weight_decay: 0.05
	lr_scheduler: cosine
	learning_rate: 4e-5
	warmup_ratio: 0.1

	load_in_8bit: false
	load_in_4bit: false
	bfloat16: true
	tf32: true

	flash_attention: true
	torch_compile: true # requires >= torch 2.0, may sometimes cause problems
	torch_compile_backend: inductor # Optional[str]
	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: false

	# hyperparams for freq of evals, saving, etc
	evals_per_epoch: 5
	saves_per_epoch: 3
	save_safetensors: true
	save_total_limit: 1
	output_dir: ./output-axolotl/output-model-6.3b
	logging_steps: 8

	deepspeed:

	special_tokens:
	pad_token: <\|end_of_text\|>

	```

	</details><br>

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| No log \| 0.0006 \| 1 \| 7.8100 \|
	\| 2.2782 \| 0.2002 \| 320 \| 2.3728 \|
	\| 2.2699 \| 0.4004 \| 640 \| 2.3265 \|
	\| 2.3761 \| 0.6006 \| 960 \| 2.2849 \|
	\| 2.2448 \| 0.8008 \| 1280 \| 2.2702 \|

	---
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_pszemraj__Llama-3-6.3b-v0.1)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|10.28\|
	\|IFEval (0-Shot) \|10.44\|
	\|BBH (3-Shot) \|18.68\|
	\|MATH Lvl 5 (4-Shot)\| 1.51\|
	\|GPQA (0-shot) \| 4.47\|
	\|MuSR (0-shot) \| 6.15\|
	\|MMLU-PRO (5-shot) \|20.44\|