Chikuma_10.7B_v2 / README.md
sethuiyer's picture
Adding Evaluation Results (#1)
ac339b9 verified
---
license: apache-2.0
library_name: transformers
tags:
- dpo
datasets:
- argilla/distilabel-intel-orca-dpo-pairs
base_model: sethuiyer/Chikuma_10.7B
pipeline_tag: text-generation
model-index:
- name: distilabled_Chikuma_10.7B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 66.38
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 85.14
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 64.7
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 59.2
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 79.4
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 58.38
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/distilabled_Chikuma_10.7B
name: Open LLM Leaderboard
---
# Chikuma_10.7B - V2 (Enhanced with DPO) [For Experiments]
<p align="center">
<img src="https://huggingface.co/sethuiyer/distilabled_Chikuma_10.7B/resolve/main/chikuma_v2.webp" height="256px" alt="Chikuma">
</p>
This model is the **DPO fine tuned version** of [Chikuma_10.7B](https://huggingface.co/sethuiyer/Chikuma_10.7B), which was a depth upscaled merge of:
* [sethuiyer/SynthIQ-7b](https://huggingface.co/sethuiyer/SynthIQ-7b)
* [openchat/openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106)
The name "Chikuma" is inspired by the [Chikuma River](https://en.wikipedia.org/wiki/Shinano_River), the longest in Japan, known for its continuous flow and meandering path.
This metaphorically represents the model's depth, fluidity, and adaptability in processing and understanding language.
# Dataset used for Fine Tuning
Dataset: `/argilla/distilabel-intel-orca-dpo-pairs`
The dataset was roughly ~3000 samples but they were high quality (according to the chosen_score).
The following filters were applied to the original dataset:
```python
dataset = dataset.filter(
lambda r:
r["status"] != "tie" and
r["chosen_score"] >= 8 and
not r["in_gsm8k_train"]
)
```
# Chat Template
The chat template for Chikuma_10.7B - V2 is a modified version of ChatML, optimized for improved interaction and engagement:
```
<|im_start|>GPT4 Correct system:
{system} Always use <|end_of_turn|> when you want to end the answer. <|im_end|>
<|im_start|>GPT4 Correct user:
{user}<|im_end|>
<|im_start|>GPT4 Correct Assistant:
{asistant}<|im_end|>
```
## Nous Benchmark Evaluation
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|-------------------------------|---------|---------|------------|----------|---------|
| SynthIQ-7b | 42.67 | 73.71 | 56.51 | 44.59 | 54.37 |
| openchat/openchat-3.5-0106 | **44.17** | 73.72 | 52.53 | 44.4 | 53.71 |
| Chikuma_10.7B | 42.41 | 73.41 | 56.69 | 43.5 | 54.00 |
| **Chikuma_10.7B_v2** | 42.77 | **73.81** | **58.83** | **44.83** | **55.06** |
# OpenLLM Leaderboard
| Benchmark Name | Performance |
|----------------|-------------|
| ARC | 66.38 |
| HellaSwag | 85 |
| MMLU | 65.27 |
| TruthfulQA | 58.83 |
| Winogrande | 78.77 |
| GSM8K | 63.68 |
| **Average** | **69.65** |
### Training Environment
- Hardware: Single A100 80GB GPU in a runpod, utilized for approximately 1.5 hours.
- Training Script: Accessible via [Google Colab Notebook](https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing). Special thanks to [mlabonne](https://huggingface.co/mlabonne) for providing the template.
## Usage
```python
# Format prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(new_model)
# Create pipeline
pipeline = transformers.pipeline(
"text-generation",
model=new_model,
tokenizer=tokenizer,
device="cuda"
)
# Generate text
message = [
{"role": "system", "content": "You are a helpful assistant chatbot."},
{"role": "user", "content": "Who invented LLMs?"}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)
sequences = pipeline(
prompt,
max_new_tokens=512
)
print(sequences[0]['generated_text'])
```
## Acknowledgements
A heartfelt appreciation goes to the vibrant open-source community, particularly:
* The Intel team for publishing a great open dataset and show how well it worked in the first place
* Teknium and NousResearch for their awesome work and models.
* Maxime for sharing such great resources.
* Argilla for publishing argilla/distilabel-intel-orca-dpo-pairs
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_sethuiyer__distilabled_Chikuma_10.7B)
| Metric |Value|
|---------------------------------|----:|
|Avg. |68.87|
|AI2 Reasoning Challenge (25-Shot)|66.38|
|HellaSwag (10-Shot) |85.14|
|MMLU (5-Shot) |64.70|
|TruthfulQA (0-shot) |59.20|
|Winogrande (5-shot) |79.40|
|GSM8k (5-shot) |58.38|