Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
Inference Endpoints
instruction-pretrain commited on
Commit
4eab7f5
·
verified ·
1 Parent(s): 2ada9cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -9,17 +9,19 @@ This repo contains the **context-based instruction synthesizer** used in our pap
9
  We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
10
 
11
  <p align='center'>
12
- <img src="./hf_intro.png" width="400">
13
  </p>
14
 
15
  ## Synthesize Instruction-Response Pairs from Any Raw Corproa
16
  We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
17
 
18
  <p align='center'>
19
- <img src="./hf_synthesizer.png" width="700">
20
  </p>
21
 
22
- An example script to prompt the synthesizer to generate instruction-response pairs based on the given raw text is:
 
 
23
  ```python
24
  from transformers import AutoModelForCausalLM, AutoTokenizer
25
 
 
9
  We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
10
 
11
  <p align='center'>
12
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/vRdsFIVQptbNaGiZ18Lih.png" width="400">
13
  </p>
14
 
15
  ## Synthesize Instruction-Response Pairs from Any Raw Corproa
16
  We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
17
 
18
  <p align='center'>
19
+ <img src="./https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
20
  </p>
21
 
22
+ The fine-tuning data are available at [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
23
+
24
+ To prompt the synthesizer to generate instruction-response pairs based on a given raw text:
25
  ```python
26
  from transformers import AutoModelForCausalLM, AutoTokenizer
27