Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## ๐ Introduction
|
2 |
+
|
3 |
+
**DistilQwen2-7B** is a distilled version of **Qwen2-7B-Instruct**, designed to distill the capabilities of stronger LLMs into smaller ones. To achieve this, we utilized a diverse range of datasets for the distillation process, including well-known open-source collections such as Magpie, Openhermes, and Mammoth 2, as well as proprietary synthetic datasets.
|
4 |
+
|
5 |
+
The training data primarily consists of instructions in Chinese and English. To enhance the quality and diversity of the instruction data, we implemented a difficulty scoring system and task-related resampling techniques.
|
6 |
+
|
7 |
+
For difficulty scoring, we employed the LLM-as-a-Judge paradigm, using the teacher model to evaluate responses based on accuracy, relevance, helpfulness, and level of detail. We then calculated the Model Fitting Difficulty (MFD) Score by subtracting the teacher model's score from the student model's score. A higher MFD Score indicates that the instruction is more valuable for distillation training. This approach allowed us to remove low-difficulty instructions from the training set, focusing on more challenging and informative examples.
|
8 |
+
|
9 |
+
This careful curation and scoring process ensures that **DistilQwen2-7B** achieves high performance after the distillation process.
|
10 |
+
|
11 |
+
## ๐ Quick Start
|
12 |
+
|
13 |
+
Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
|
14 |
+
|
15 |
+
```python
|
16 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
17 |
+
device = "cuda" # the device to load the model onto
|
18 |
+
|
19 |
+
model = AutoModelForCausalLM.from_pretrained(
|
20 |
+
"alibaba-pai/DistilQwen2-7B-Instruct",
|
21 |
+
torch_dtype="auto",
|
22 |
+
device_map="auto"
|
23 |
+
)
|
24 |
+
tokenizer = AutoTokenizer.from_pretrained("alibaba-pai/DistilQwen2-7B-Instruct")
|
25 |
+
|
26 |
+
prompt = "Give me a short introduction to large language model."
|
27 |
+
messages = [
|
28 |
+
{"role": "user", "content": prompt}
|
29 |
+
]
|
30 |
+
text = tokenizer.apply_chat_template(
|
31 |
+
messages,
|
32 |
+
tokenize=False,
|
33 |
+
add_generation_prompt=True
|
34 |
+
)
|
35 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
36 |
+
|
37 |
+
generated_ids = model.generate(
|
38 |
+
model_inputs.input_ids,
|
39 |
+
max_new_tokens=2048๏ผ
|
40 |
+
)
|
41 |
+
generated_ids = [
|
42 |
+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
43 |
+
]
|
44 |
+
|
45 |
+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
46 |
+
```
|
47 |
+
|
48 |
+
## ๐ Evaluation
|
49 |
+
|
50 |
+
We used single-turn instructions from MT-Bench as input for Qwen2-1.5B-Instruct and Qwen2-7B-Instruct. GPT4-turbo is used to evaluate the changes in the level of detail and truthfulness of responses to our model's revised instructions.
|
51 |
+
|
52 |
+
|
53 |
+
| Model | AlpacaEval 2.0 (length-controlled) | MT-Bench | MT-Bench (single) | IFEval (instruction-loose) | IFEval (strict-prompt) |
|
54 |
+
|------|-----------------------------------|----------|-------------------|---------------------------|------------------------|
|
55 |
+
| Qwen2-1.5B-Instruct | 5.22 | 5.85 | 6.45 | 41.37 | 28.10 |
|
56 |
+
| DistilQwen2-1.5B-Instruct | 8.28 | 6.42 | 7.12 | 49.76 | 36.04 |
|
57 |
+
| Qwen2-7B-Instruct | 24.33 | 8.27 | 8.68 | 66.67 | 52.31 |
|
58 |
+
| DistilQwen2-7B-Instruct | 25.35 | 8.40 | 9.03 | 71.46 | 60.26 |
|
59 |
+
|
60 |
+
|
61 |
+
## ๐ Citation
|
62 |
+
|
63 |
+
If you find our work helpful, please cite it!
|
64 |
+
|
65 |
+
```
|
66 |
+
@misc{TAPIR,
|
67 |
+
title={Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning},
|
68 |
+
author={Yuanhao Yue and Chengyu Wang and Jun Huang and Peng Wang},
|
69 |
+
year={2024},
|
70 |
+
eprint={2405.13448},
|
71 |
+
archivePrefix={arXiv},
|
72 |
+
primaryClass={cs.CL},
|
73 |
+
url={https://arxiv.org/abs/2405.13448},
|
74 |
+
}
|
75 |
+
```
|