File size: 6,810 Bytes
ce087b2 fa5a3c6 ed66854 f7092b9 eb75edf 315f3f6 f7092b9 eb75edf f7092b9 a974073 315f3f6 a974073 315f3f6 a974073 f7092b9 59397a5 fa5a3c6 3069b9f fa5a3c6 eb75edf fa5a3c6 eb75edf f7092b9 eb75edf 93e0bdb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
license: apache-2.0
---
<img src="https://huggingface.co/datasets/hkust-nlp/deita-images/resolve/main/logo-final.png" alt="Deita banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
# Model Card for Deita Complexity Scorer
[GitHub](https://github.com/hkust-nlp/deita) | [Paper](https://arxiv.org/abs/2312.15685)
Deita is an open-sourced project designed to facilitate **Automatic Data Selection** for instruction tuning in Large Language Models (LLMs).
Deita Complexity Scorer is a tool for automatically annotating the Instruction Complexity of SFT data.
## Model description
- **Model type:** Model fine tuned to automatically annotate the Instruction Complexity
- **Language(s) (NLP):** Primarily English
- **Finetuned from model:** Llama-1-13b-hf
### Model Sources
- **Repository:** https://github.com/hkust-nlp/deita
- **Model Family:** Other models and the dataset are found in the [Deita collection](https://huggingface.co/collections/hkust-nlp/deita-6569c198c174808d94cf5bd4).
## Performance
| Model | Align | Data Size | MT-Bench | AlpacaEval(%) | OpenLLM (Avg.) |
|------------------------------------------------|-----------|------------|----------|---------------|----------------|
| **Proprietary Models** | | | | | |
| GPT-4-Turbo | ? | -- | 9.32 | 97.70 | -- |
| GPT-4 | SFT + PPO | -- | 8.99 | 95.03 | -- |
| Claude-2 | SFT + PPO | -- | 8.06 | 91.36 | -- |
| GPT-3.5-turbo | SFT + PPO | -- | 7.94 | 89.37 | -- |
| **Open-sourced Models based on LLaMA-1-13B** | | | | | |
| LIMA | SFT | 1K SFT | 4.29 | 41.98 | 59.82 |
| WizardLM-13B | SFT | 70K SFT | 6.35 | 75.31 | 58.96 |
| Vicuna-13B-v1.3 | SFT | 125K SFT | 6.39 | 82.11 | 60.01 |
| Random | SFT | 10K SFT | 6.03 | 71.52 | 60.14 |
| DEITA-LLaMA1-13B-v1.0-sft | SFT | 10K SFT | 6.60 | 78.01 | 64.27 |
| **Open-sourced Models based on LLaMA-2-13B** | | | | | |
| Tulu-2-13B | SFT | 326K SFT | 6.70 | 78.90 | -- |
| Tulu-2-13B+DPO | SFT + DPO | 326K SFT + 60K DPO | 7.00 | 89.50 | -- |
| LLaMA2-13B-Chat | SFT + PPO | -- | 6.65 | 81.09 | -- |
| WizardLM-13B-v1.2 | SFT | >70K SFT | 7.09 | 89.17 | -- |
| Vicuna-13B-v1.5 | SFT | 125K SFT | 6.57 | 78.80 | 61.63 |
| Random | SFT | 10K SFT | 5.78 | 65.19 | 61.32 |
| DEITA-LLaMA2-13B-v1.0-sft | SFT | 10K SFT | 6.79 | 81.09 | 62.71 |
| **Open-sourced Models based on Mistral-7B** | | | | | |
| Mistral-7B-Instruct-v0.1 | -- | -- | 6.84 | 69.65 | 60.45 |
| Zephyr-7B-sft | SFT | 200K SFT | 5.32 | 75.12 | 60.93 |
| $\text{Zephyr-7B-}\beta$ | SFT + DPO | 200K SFT + 60K DPO | 7.34 | 90.60 | 66.36 |
| OpenChat-3.5 | C-RLFT | >> 70K C-RLFT | 7.81 | 88.51 | -- |
| Starling-7B | C-RLFT + APA | >>70K C-RLFT + 183K APA | 8.09 | 91.99 | -- |
| Random | SFT | 10K SFT | 5.89 | 56.90 | 61.72 |
| DEITA-7B-v1.0-sft (6K) | SFT | 6K SFT | 7.22 | 80.78 | 64.94 |
| DEITA-7B-v1.0-sft (10K) | SFT | 10K SFT | 7.32 | 81.67 | 64.00 |
| DEITA-7B-v1.0 | SFT + DPO | 6K SFT + 10K DPO | 7.55 | 90.06 | 69.86 |
## Usage
Please use the following format to score the complexity of the Instruction:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
from scipy.special import softmax
model_name = "hkust-nlp/deita-complexity-scorer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def infer_complexity(model, tokenizer, input_text):
complexity_template = ("You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: ")
user_input = complexity_template.format(instruction=input_text)
input_ids = tokenizer.encode(user_input, return_tensors="pt")
max_length = 512
outputs = model.generate(input_ids, max_length=512, num_return_sequences=1, return_dict_in_generate=True, output_scores=True)
logprobs_list = outputs.scores[0][0]
score_logits = []
id2score = {
29896: "1",
29906: "2",
29941: "3",
29946: "4",
29945: "5",
29953: "6"
}
score_template = np.array([1,2,3,4,5,6])
for k in id2score:
score_logits.append(logprobs_list[k])
score_logits = np.array(score_logits)
score_npy = softmax(score_logits, axis=0)
score_npy = score_npy * score_template
score_npy = np.sum(score_npy, axis=0)
return score_npy
# example input
input_text = "write a performance review for a junior data scientist"
complexity_score = infer_complexity(model, tokenizer, input_text)
print(complexity_score)
```
## Citation
If you find the content of this project helpful, please cite our paper as follows:
```
@misc{liu2023what,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
year={2023},
eprint={2312.15685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
|