instruction-pretrain
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -9,17 +9,19 @@ This repo contains the **context-based instruction synthesizer** used in our pap
|
|
9 |
We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
|
10 |
|
11 |
<p align='center'>
|
12 |
-
<img src="
|
13 |
</p>
|
14 |
|
15 |
## Synthesize Instruction-Response Pairs from Any Raw Corproa
|
16 |
We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
|
17 |
|
18 |
<p align='center'>
|
19 |
-
<img src="./
|
20 |
</p>
|
21 |
|
22 |
-
|
|
|
|
|
23 |
```python
|
24 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
25 |
|
|
|
9 |
We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
|
10 |
|
11 |
<p align='center'>
|
12 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/vRdsFIVQptbNaGiZ18Lih.png" width="400">
|
13 |
</p>
|
14 |
|
15 |
## Synthesize Instruction-Response Pairs from Any Raw Corproa
|
16 |
We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
|
17 |
|
18 |
<p align='center'>
|
19 |
+
<img src="./https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
|
20 |
</p>
|
21 |
|
22 |
+
The fine-tuning data are available at [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
|
23 |
+
|
24 |
+
To prompt the synthesizer to generate instruction-response pairs based on a given raw text:
|
25 |
```python
|
26 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
27 |
|