lewtun HF staff commited on
Commit
0f4b871
1 Parent(s): e981ffb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -6
README.md CHANGED
@@ -40,18 +40,18 @@ Zephyr is a series of language models that are trained to act as helpful assista
40
 
41
  ## Performance
42
 
43
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6200d0a443eb0913fa2df7cc/raxvt5ma16d7T23my34WC.png)
44
 
45
- | Model | Size | Align | MT-Bench (score) | AlpacaEval (win %) |
46
  |-------------|-----|----|---------------|--------------|
47
  | StableLM-Tuned-α | 7B| dSFT |2.75| -|
48
  | MPT-Chat | 7B |dSFT |5.42| -|
49
  | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
50
  | Mistral-Instructv0.1 | 7B| - | 6.84 |-|
51
  | Zephyr-7b-α |7B| dDPO| 6.88| -|
52
- | **Zephyr-7b-β** |7B| dDPO| 7.34| 90.60|
53
  | Falcon-Instruct | 40B |dSFT |5.17 |45.71|
54
- | Guanaco 65B | SFT |6.41| 71.80|
55
  | Llama2-Chat | 70B |RLHF |6.86| 92.66|
56
  | Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
57
  | WizardLM v1.0 | 70B |dSFT |7.71 |-|
@@ -60,6 +60,13 @@ Zephyr is a series of language models that are trained to act as helpful assista
60
  | Claude 2 | - |RLHF |8.06| 91.36|
61
  | GPT-4 | -| RLHF |8.99| 95.28|
62
 
 
 
 
 
 
 
 
63
  ## Intended uses & limitations
64
 
65
  The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
@@ -108,9 +115,8 @@ It is also unknown what the size and composition of the corpus was used to train
108
 
109
  ## Training and evaluation data
110
 
111
- This model is a fine-tuned version of [HuggingFaceH4/mistral-7b-ift](https://huggingface.co/HuggingFaceH4/mistral-7b-ift) TODO on the ultrafeedback dataset. It achieves the following results on the evaluation set:
112
 
113
- It achieves the following results on the evaluation set:
114
  - Loss: 0.7496
115
  - Rewards/chosen: -4.5221
116
  - Rewards/rejected: -8.3184
@@ -140,6 +146,9 @@ The following hyperparameters were used during training:
140
 
141
  ### Training results
142
 
 
 
 
143
  | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
144
  |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
145
  | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
 
40
 
41
  ## Performance
42
 
43
+ At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
44
 
45
+ | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
46
  |-------------|-----|----|---------------|--------------|
47
  | StableLM-Tuned-α | 7B| dSFT |2.75| -|
48
  | MPT-Chat | 7B |dSFT |5.42| -|
49
  | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
50
  | Mistral-Instructv0.1 | 7B| - | 6.84 |-|
51
  | Zephyr-7b-α |7B| dDPO| 6.88| -|
52
+ | **Zephyr-7b-β** 🪁 | **7B** | **dDPO** | **7.34** | **90.60** |
53
  | Falcon-Instruct | 40B |dSFT |5.17 |45.71|
54
+ | Guanaco | 65B | SFT |6.41| 71.80|
55
  | Llama2-Chat | 70B |RLHF |6.86| 92.66|
56
  | Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
57
  | WizardLM v1.0 | 70B |dSFT |7.71 |-|
 
60
  | Claude 2 | - |RLHF |8.06| 91.36|
61
  | GPT-4 | -| RLHF |8.99| 95.28|
62
 
63
+ In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B:
64
+
65
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6200d0a443eb0913fa2df7cc/raxvt5ma16d7T23my34WC.png)
66
+
67
+ However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap.
68
+
69
+
70
  ## Intended uses & limitations
71
 
72
  The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
 
115
 
116
  ## Training and evaluation data
117
 
118
+ During DPO training, this model achieves the following results on the evaluation set:
119
 
 
120
  - Loss: 0.7496
121
  - Rewards/chosen: -4.5221
122
  - Rewards/rejected: -8.3184
 
146
 
147
  ### Training results
148
 
149
+ The table below shows the full set of DPO training metrics:
150
+
151
+
152
  | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
153
  |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
154
  | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |