Update README.md
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ FuseChat-3.0: Preference Optimization for Implicit Model Fusion
|
|
19 |
<img src="FuseChat-3.0.png" width=70%/>
|
20 |
</div>
|
21 |
|
22 |
-
We present FuseChat-3.0, a series of models crafted to enhance performance by integrating the strengths of multiple source LLMs into more compact target LLMs. To achieve this fusion, we utilized four powerful source LLMs: Gemma-2-27B-
|
23 |
|
24 |
|
25 |
|
@@ -32,7 +32,7 @@ FuseChat-3.0, however, takes a different approach by enhancing a single LLM thro
|
|
32 |
|
33 |
Our IMF method follows a three-stage process aimed at effectively transferring capabilities from source LLMs to a target LLM. First, during **dataset construction**, we sample N responses from each of the source LLMs and annotate these responses using an external reward model. Second, in the **supervised fine-tuning (SFT)** stage, we fine-tune the target model using the best responses, which not only enhances the target model's capabilities but also helps mitigate the distributional gap between the source and target models. Finally, in the **direct preference optimization (DPO)** stage, we optimize the target model by using the best and worst responses from the source models as preference pairs, further enhancing the target model's performance. The complete pipeline will be detailed in the following paragraph.
|
34 |
|
35 |
-
## Dataset
|
36 |
### Prompt Selection
|
37 |
Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
|
38 |
|
@@ -41,8 +41,8 @@ Our datasets were designed to enhance model's instruction following, general con
|
|
41 |
- **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
|
42 |
- **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
|
43 |
|
44 |
-
### Sampling
|
45 |
-
For each dataset's prompts, we synthesized responses mainly from four different series of source models, specifically [Gemma-2-27b-
|
46 |
|
47 |
- **Instruction Following & General Conversation**: We sampled each prompt five times from all the source models.
|
48 |
- **Mathematics**: We retained the responses generated by Llama-3.1-405B-Instruct from the original dataset (OpenMathInstruct-2) and additionally sampled responses using [Qwen-2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen-2.5-Math-72B-Instruct).
|
@@ -58,7 +58,7 @@ The sampling parameters for different models are detailed in Table below.
|
|
58 |
</tr>
|
59 |
|
60 |
<tr>
|
61 |
-
<td>Gemma-2-27b-
|
62 |
<td>Temp 0.8 Top-p 0.95</td>
|
63 |
</tr>
|
64 |
|
@@ -80,7 +80,7 @@ The sampling parameters for different models are detailed in Table below.
|
|
80 |
|
81 |
</table>
|
82 |
|
83 |
-
###
|
84 |
- **Instruction Following**: To assign RM scores to the five responses generated by each source model, we employed [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) for annotation. We then divided the annotated data into SFT and DPO datasets using a 4:6 ratio. For the SFT phase, we selected the responses with the highest RM scores. During the DPO phase, we paired responses from the same source model, designating those with the highest RM scores as positive samples and those with the lowest RM scores as negative samples. We ensured that the RM score difference between the positive and negative samples in each pair ranged from 0.01 to 0.1.
|
85 |
- **Mathematics**: We initially annotated the responses from all source models for correctness by comparing them with the gold labels and evaluating them using the RM scores provided by ArmoRM. We then strategically divided the dataset into SFT phase and DPO phase. In the SFT phase, we incorporated responses that were correct and had the highest RM scores. This selection ensured that the fine-tuning process was based on high-quality responses that aligned closely with the desired outcomes. For the DPO phase, we constructed paired samples from the same source model. The positive samples consisted of correct answers with the highest RM scores, while the negative samples were incorrect answers with the lowest RM scores. To ensure meaningful comparisons during optimization, we maintained an RM score differential between positive and negative pairs within the range of 0.01 to 0.1.
|
86 |
- **Coding**: We employed a dual-scoring system comprising correctness scores and RM scores for coding evaluation. The correctness scores assessed whether the code passed both static analysis and test cases, ensuring functional accuracy. The RM scores were used for preference evaluation, gauging the quality of responses based on predefined criteria. During the SFT phase, we included responses that not only passed all test cases but also achieved the highest RM scores. This selection ensured that the model was fine-tuned on exemplary code that met both correctness and preference standards. In the DPO phase, we contrasted positive samples—high-scoring responses that passed the tests—with negative samples—low-scoring responses that failed the tests. This comparison aimed to optimize the model's ability to prefer higher-quality code during training. We excluded any instances where all model responses failed to meet the testing criteria. This exclusion was necessary to maintain the integrity of the evaluation process, as such cases did not provide meaningful data for assessing and improving the model's performance.
|
@@ -172,7 +172,7 @@ Our final dataset comprised 158,784 total entries, with 94,539 entries for the S
|
|
172 |
|
173 |
</table>
|
174 |
|
175 |
-
## Training
|
176 |
The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs.
|
177 |
|
178 |
### SFT
|
@@ -196,7 +196,7 @@ We used [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) as our fine-tu
|
|
196 |
</tr>
|
197 |
|
198 |
<tr>
|
199 |
-
<td>Gemma-2-9B-
|
200 |
<td>2e-6</td>
|
201 |
</tr>
|
202 |
|
@@ -238,7 +238,7 @@ Different models' hyperparameters are shown in the table below.
|
|
238 |
</tr>
|
239 |
|
240 |
<tr>
|
241 |
-
<td>FuseChat-Gemma-2-
|
242 |
<td>5e-7</td>
|
243 |
<td>0.01</td>
|
244 |
<td>No</td>
|
@@ -267,7 +267,9 @@ We include more details and release our evaluation code at [FuseEval](https://gi
|
|
267 |
|
268 |
The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
|
269 |
|
|
|
270 |
### FuseChat-Llama-3.2-3B-Instruct Performance
|
|
|
271 |
<table class="js-sort-table table hidden">
|
272 |
<tr>
|
273 |
<td class="js-sort-string"><strong>Benchmarks</strong></td>
|
@@ -280,7 +282,7 @@ The evaluation results of five series fused models are as follows, showing that
|
|
280 |
<td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
|
281 |
<td>21.4</td>
|
282 |
<td>31.1</td>
|
283 |
-
<td><strong>54</strong></td>
|
284 |
</tr>
|
285 |
|
286 |
<tr>
|
@@ -292,23 +294,23 @@ The evaluation results of five series fused models are as follows, showing that
|
|
292 |
|
293 |
<tr>
|
294 |
<td>MT-Bench</td>
|
295 |
-
<td>6.
|
296 |
-
<td>7.
|
297 |
-
<td><strong>7.
|
298 |
</tr>
|
299 |
|
300 |
<tr>
|
301 |
<td>AlignBench v1.1</td>
|
302 |
-
<td>3.
|
303 |
<td>5.5</td>
|
304 |
-
<td><strong>5.
|
305 |
</tr>
|
306 |
|
307 |
<tr>
|
308 |
<td>GSM8K</td>
|
309 |
-
<td>82</td>
|
310 |
<td><strong>82.8</strong></td>
|
311 |
-
<td>82</td>
|
312 |
</tr>
|
313 |
|
314 |
<tr>
|
@@ -321,8 +323,8 @@ The evaluation results of five series fused models are as follows, showing that
|
|
321 |
<tr>
|
322 |
<td>AMC23</td>
|
323 |
<td>22.5</td>
|
324 |
-
<td>20</td>
|
325 |
-
<td><strong>35</strong></td>
|
326 |
</tr>
|
327 |
|
328 |
<tr>
|
@@ -343,7 +345,7 @@ The evaluation results of five series fused models are as follows, showing that
|
|
343 |
<td>MMLU-redux</td>
|
344 |
<td>58.5</td>
|
345 |
<td>58.2</td>
|
346 |
-
<td><strong>59</strong></td>
|
347 |
</tr>
|
348 |
|
349 |
<tr>
|
@@ -355,7 +357,7 @@ The evaluation results of five series fused models are as follows, showing that
|
|
355 |
|
356 |
<tr>
|
357 |
<td>HumanEval</td>
|
358 |
-
<td>61</td>
|
359 |
<td><strong>62.8</strong></td>
|
360 |
<td>60.4</td>
|
361 |
</tr>
|
@@ -371,7 +373,7 @@ The evaluation results of five series fused models are as follows, showing that
|
|
371 |
<td>LiveCodeBench<br>2408-2411</td>
|
372 |
<td>8.3</td>
|
373 |
<td>7.1</td>
|
374 |
-
<td><strong>9</strong></td>
|
375 |
</tr>
|
376 |
|
377 |
<tr>
|
@@ -382,8 +384,7 @@ The evaluation results of five series fused models are as follows, showing that
|
|
382 |
</tr>
|
383 |
</table>
|
384 |
|
385 |
-
|
386 |
-
## BibTeX
|
387 |
```
|
388 |
@article{yang2024wrpo,
|
389 |
title={Weighted-Reward Preference Optimization for Implicit Model Fusion},
|
|
|
19 |
<img src="FuseChat-3.0.png" width=70%/>
|
20 |
</div>
|
21 |
|
22 |
+
We present FuseChat-3.0, a series of models crafted to enhance performance by integrating the strengths of multiple source LLMs into more compact target LLMs. To achieve this fusion, we utilized four powerful source LLMs: Gemma-2-27B-It, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For the target LLMs, we employed three widely-used smaller models—Llama-3.1-8B-Instruct, Gemma-2-9B-It, and Qwen-2.5-7B-Instruct—along with two even more compact models—Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs. The resulting FuseChat-3.0 models demonstrated substantial improvements in tasks related to general conversation, instruction following, mathematics, and coding. Notably, when Llama-3.1-8B-Instruct served as the target LLM, our fusion approach achieved an average improvement of 6.8 points across 14 benchmarks. Moreover, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively. We have released the [FuseChat-3.0](https://huggingface.co/FuseAI) models on Huggingface, stay tuned for the forthcoming dataset and code.
|
23 |
|
24 |
|
25 |
|
|
|
32 |
|
33 |
Our IMF method follows a three-stage process aimed at effectively transferring capabilities from source LLMs to a target LLM. First, during **dataset construction**, we sample N responses from each of the source LLMs and annotate these responses using an external reward model. Second, in the **supervised fine-tuning (SFT)** stage, we fine-tune the target model using the best responses, which not only enhances the target model's capabilities but also helps mitigate the distributional gap between the source and target models. Finally, in the **direct preference optimization (DPO)** stage, we optimize the target model by using the best and worst responses from the source models as preference pairs, further enhancing the target model's performance. The complete pipeline will be detailed in the following paragraph.
|
34 |
|
35 |
+
## Dataset
|
36 |
### Prompt Selection
|
37 |
Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
|
38 |
|
|
|
41 |
- **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
|
42 |
- **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
|
43 |
|
44 |
+
### Response Sampling
|
45 |
+
For each dataset's prompts, we synthesized responses mainly from four different series of source models, specifically [Gemma-2-27b-It](https://huggingface.co/google/gemma-2-27b-it), [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407), [Qwen-2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct), and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct).
|
46 |
|
47 |
- **Instruction Following & General Conversation**: We sampled each prompt five times from all the source models.
|
48 |
- **Mathematics**: We retained the responses generated by Llama-3.1-405B-Instruct from the original dataset (OpenMathInstruct-2) and additionally sampled responses using [Qwen-2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen-2.5-Math-72B-Instruct).
|
|
|
58 |
</tr>
|
59 |
|
60 |
<tr>
|
61 |
+
<td>Gemma-2-27b-It</td>
|
62 |
<td>Temp 0.8 Top-p 0.95</td>
|
63 |
</tr>
|
64 |
|
|
|
80 |
|
81 |
</table>
|
82 |
|
83 |
+
### Data Construction
|
84 |
- **Instruction Following**: To assign RM scores to the five responses generated by each source model, we employed [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) for annotation. We then divided the annotated data into SFT and DPO datasets using a 4:6 ratio. For the SFT phase, we selected the responses with the highest RM scores. During the DPO phase, we paired responses from the same source model, designating those with the highest RM scores as positive samples and those with the lowest RM scores as negative samples. We ensured that the RM score difference between the positive and negative samples in each pair ranged from 0.01 to 0.1.
|
85 |
- **Mathematics**: We initially annotated the responses from all source models for correctness by comparing them with the gold labels and evaluating them using the RM scores provided by ArmoRM. We then strategically divided the dataset into SFT phase and DPO phase. In the SFT phase, we incorporated responses that were correct and had the highest RM scores. This selection ensured that the fine-tuning process was based on high-quality responses that aligned closely with the desired outcomes. For the DPO phase, we constructed paired samples from the same source model. The positive samples consisted of correct answers with the highest RM scores, while the negative samples were incorrect answers with the lowest RM scores. To ensure meaningful comparisons during optimization, we maintained an RM score differential between positive and negative pairs within the range of 0.01 to 0.1.
|
86 |
- **Coding**: We employed a dual-scoring system comprising correctness scores and RM scores for coding evaluation. The correctness scores assessed whether the code passed both static analysis and test cases, ensuring functional accuracy. The RM scores were used for preference evaluation, gauging the quality of responses based on predefined criteria. During the SFT phase, we included responses that not only passed all test cases but also achieved the highest RM scores. This selection ensured that the model was fine-tuned on exemplary code that met both correctness and preference standards. In the DPO phase, we contrasted positive samples—high-scoring responses that passed the tests—with negative samples—low-scoring responses that failed the tests. This comparison aimed to optimize the model's ability to prefer higher-quality code during training. We excluded any instances where all model responses failed to meet the testing criteria. This exclusion was necessary to maintain the integrity of the evaluation process, as such cases did not provide meaningful data for assessing and improving the model's performance.
|
|
|
172 |
|
173 |
</table>
|
174 |
|
175 |
+
## Training
|
176 |
The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs.
|
177 |
|
178 |
### SFT
|
|
|
196 |
</tr>
|
197 |
|
198 |
<tr>
|
199 |
+
<td>Gemma-2-9B-It</td>
|
200 |
<td>2e-6</td>
|
201 |
</tr>
|
202 |
|
|
|
238 |
</tr>
|
239 |
|
240 |
<tr>
|
241 |
+
<td>FuseChat-Gemma-2-9B-SFT</td>
|
242 |
<td>5e-7</td>
|
243 |
<td>0.01</td>
|
244 |
<td>No</td>
|
|
|
267 |
|
268 |
The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
|
269 |
|
270 |
+
|
271 |
### FuseChat-Llama-3.2-3B-Instruct Performance
|
272 |
+
|
273 |
<table class="js-sort-table table hidden">
|
274 |
<tr>
|
275 |
<td class="js-sort-string"><strong>Benchmarks</strong></td>
|
|
|
282 |
<td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
|
283 |
<td>21.4</td>
|
284 |
<td>31.1</td>
|
285 |
+
<td><strong>54.0</strong></td>
|
286 |
</tr>
|
287 |
|
288 |
<tr>
|
|
|
294 |
|
295 |
<tr>
|
296 |
<td>MT-Bench</td>
|
297 |
+
<td>6.9</td>
|
298 |
+
<td>7.3</td>
|
299 |
+
<td><strong>7.7</strong></td>
|
300 |
</tr>
|
301 |
|
302 |
<tr>
|
303 |
<td>AlignBench v1.1</td>
|
304 |
+
<td>3.8</td>
|
305 |
<td>5.5</td>
|
306 |
+
<td><strong>5.9</strong></td>
|
307 |
</tr>
|
308 |
|
309 |
<tr>
|
310 |
<td>GSM8K</td>
|
311 |
+
<td>82.0</td>
|
312 |
<td><strong>82.8</strong></td>
|
313 |
+
<td>82.0</td>
|
314 |
</tr>
|
315 |
|
316 |
<tr>
|
|
|
323 |
<tr>
|
324 |
<td>AMC23</td>
|
325 |
<td>22.5</td>
|
326 |
+
<td>20.0</td>
|
327 |
+
<td><strong>35.0</strong></td>
|
328 |
</tr>
|
329 |
|
330 |
<tr>
|
|
|
345 |
<td>MMLU-redux</td>
|
346 |
<td>58.5</td>
|
347 |
<td>58.2</td>
|
348 |
+
<td><strong>59.0</strong></td>
|
349 |
</tr>
|
350 |
|
351 |
<tr>
|
|
|
357 |
|
358 |
<tr>
|
359 |
<td>HumanEval</td>
|
360 |
+
<td>61.0</td>
|
361 |
<td><strong>62.8</strong></td>
|
362 |
<td>60.4</td>
|
363 |
</tr>
|
|
|
373 |
<td>LiveCodeBench<br>2408-2411</td>
|
374 |
<td>8.3</td>
|
375 |
<td>7.1</td>
|
376 |
+
<td><strong>9.0</strong></td>
|
377 |
</tr>
|
378 |
|
379 |
<tr>
|
|
|
384 |
</tr>
|
385 |
</table>
|
386 |
|
387 |
+
## Citation
|
|
|
388 |
```
|
389 |
@article{yang2024wrpo,
|
390 |
title={Weighted-Reward Preference Optimization for Implicit Model Fusion},
|