Chikuma_10.7B_v2 / README.md

Adding Evaluation Results (#1)

ac339b9 verified 4 months ago

No virus

7.37 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- dpo
	datasets:
	- argilla/distilabel-intel-orca-dpo-pairs
	base_model: sethuiyer/Chikuma_10.7B
	pipeline_tag: text-generation
	model-index:
	- name: distilabled_Chikuma_10.7B
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 66.38
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 85.14
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.7
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 59.2
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 79.4
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 58.38
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
	name: Open LLM Leaderboard
	---

	# Chikuma_10.7B - V2 (Enhanced with DPO) [For Experiments]

	<p align="center">
	<img src="https://huggingface.co/sethuiyer/distilabled_Chikuma_10.7B/resolve/main/chikuma_v2.webp" height="256px" alt="Chikuma">
	</p>


	This model is the DPO fine tuned version of [Chikuma_10.7B](https://huggingface.co/sethuiyer/Chikuma_10.7B), which was a depth upscaled merge of:
	* [sethuiyer/SynthIQ-7b](https://huggingface.co/sethuiyer/SynthIQ-7b)
	* [openchat/openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106)

	The name "Chikuma" is inspired by the [Chikuma River](https://en.wikipedia.org/wiki/Shinano_River), the longest in Japan, known for its continuous flow and meandering path.
	This metaphorically represents the model's depth, fluidity, and adaptability in processing and understanding language.


	# Dataset used for Fine Tuning
	Dataset: `/argilla/distilabel-intel-orca-dpo-pairs`

	The dataset was roughly ~3000 samples but they were high quality (according to the chosen_score).

	The following filters were applied to the original dataset:
	```python
	dataset = dataset.filter(
	lambda r:
	r["status"] != "tie" and
	r["chosen_score"] >= 8 and
	not r["in_gsm8k_train"]
	)
	```

	# Chat Template
	The chat template for Chikuma_10.7B - V2 is a modified version of ChatML, optimized for improved interaction and engagement:

	```
	<\|im_start\|>GPT4 Correct system:
	{system} Always use <\|end_of_turn\|> when you want to end the answer. <\|im_end\|>
	<\|im_start\|>GPT4 Correct user:
	{user}<\|im_end\|>
	<\|im_start\|>GPT4 Correct Assistant:
	{asistant}<\|im_end\|>
	```

	## Nous Benchmark Evaluation
	\| Model \| AGIEval \| GPT4All \| TruthfulQA \| Bigbench \| Average \|
	\|-------------------------------\|---------\|---------\|------------\|----------\|---------\|
	\| SynthIQ-7b \| 42.67 \| 73.71 \| 56.51 \| 44.59 \| 54.37 \|
	\| openchat/openchat-3.5-0106 \| 44.17 \| 73.72 \| 52.53 \| 44.4 \| 53.71 \|
	\| Chikuma_10.7B \| 42.41 \| 73.41 \| 56.69 \| 43.5 \| 54.00 \|
	\| Chikuma_10.7B_v2 \| 42.77 \| 73.81 \| 58.83 \| 44.83 \| 55.06 \|

	# OpenLLM Leaderboard

	\| Benchmark Name \| Performance \|
	\|----------------\|-------------\|
	\| ARC \| 66.38 \|
	\| HellaSwag \| 85 \|
	\| MMLU \| 65.27 \|
	\| TruthfulQA \| 58.83 \|
	\| Winogrande \| 78.77 \|
	\| GSM8K \| 63.68 \|
	\| Average \| 69.65 \|


	### Training Environment
	- Hardware: Single A100 80GB GPU in a runpod, utilized for approximately 1.5 hours.
	- Training Script: Accessible via [Google Colab Notebook](https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing). Special thanks to [mlabonne](https://huggingface.co/mlabonne) for providing the template.


	## Usage

	```python
	# Format prompt
	from transformers import AutoModelForCausalLM, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained(new_model)

	# Create pipeline
	pipeline = transformers.pipeline(
	"text-generation",
	model=new_model,
	tokenizer=tokenizer,
	device="cuda"
	)

	# Generate text

	message = [
	{"role": "system", "content": "You are a helpful assistant chatbot."},
	{"role": "user", "content": "Who invented LLMs?"}
	]

	prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

	sequences = pipeline(
	prompt,
	max_new_tokens=512
	)
	print(sequences[0]['generated_text'])
	```

	## Acknowledgements

	A heartfelt appreciation goes to the vibrant open-source community, particularly:

	* The Intel team for publishing a great open dataset and show how well it worked in the first place
	* Teknium and NousResearch for their awesome work and models.
	* Maxime for sharing such great resources.
	* Argilla for publishing argilla/distilabel-intel-orca-dpo-pairs
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_sethuiyer__distilabled_Chikuma_10.7B)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|68.87\|
	\|AI2 Reasoning Challenge (25-Shot)\|66.38\|
	\|HellaSwag (10-Shot) \|85.14\|
	\|MMLU (5-Shot) \|64.70\|
	\|TruthfulQA (0-shot) \|59.20\|
	\|Winogrande (5-shot) \|79.40\|
	\|GSM8k (5-shot) \|58.38\|