arxiv:2406.14491

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Published on Jun 20

· Submitted by

daixuancheng on Jun 21

#2 Paper of the day

Upvote

Authors:

Daixuan Cheng ,

Yuxian Gu ,

Shaohan Huang ,

Furu Wei

Abstract

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

View arXiv page View PDF Add to collection

Community

daixuancheng

Paper author Paper submitter Jun 21

•

edited Jul 15

🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗

Thanks to the demo davanstrien/instruction-synthesizer for implementing our approach
Context-Based Instruction Synthesizer: instruction-synthesizer
Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection
General Models Pre-Trained from Scratch (on 100B tokes):
- InstructLM-500M
- InstructLM-1.3B
Domain-Specific Models Pre-Trained from Llama3-8B:
- Finance-Llama3-8B
- Biomedicine-Llama3-8B
General Instruction-Augmented Corpora: general-instruction-augmented-corpora
Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): medicine-instruction-augmented-corpora

Sakuna

Jun 21

Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)?

instruction-pretrain

Jun 22

Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!

instruction-pretrain

Jun 22

This comment has been hidden

huu-ontocord

Jun 22

Hi - is the dataset you released the 200M one?

instruction-pretrain

Jun 23

Hi,

This is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too large🤔.

imamnurby

Jun 24

Hi, thanks for your work! I get few interesting insights from this.

Based on your results, can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.
Do you perform any verifications to the generated instruction-response pairs?

instruction-pretrain

Jun 24

Hi,

Thanks for your question!

Q1: Can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly?

This is a promising approach worth trying. However, it may come with two limitations:

Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.

For example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.

Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.

Q2: Do you perform any verifications of the generated instruction-response pairs?

Yes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.

s-JoL

Jun 24

In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?

instruction-pretrain

Jun 24

Thanks for your careful review. The domain names in Table 4 should be reversed.

smcleod

Jun 24

Have you given any thought to training a version on a higher ratio of code?

I am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.

instruction-pretrain

Jun 25

Hi,

Thanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.

levanduc

Jul 19

Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!

instruction-pretrain

Jul 19

•

edited Sep 4

Thanks for the question. We've added pre-training suggestions in the Advanced Usage section of our instruction synthesizer.

Using the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.

For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).

Other training details are presented in Table 10 in the Appendix of our paper:

heylimon

Aug 1

Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?

instruction-pretrain

Aug 1

Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:

Keeping It Simple: We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like Vanilla Pre-training with familiar training setups.
Fair Comparison: If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run Vanilla Pre-training and Instruction Pre-training for the same number of steps with the same batch size, Instruction Pre-training ends up with fewer trained tokens, making it an unfair comparison.

Anyway, it’s a promising idea and worth trying since it has worked well in instruction tuning.

instruction-pretrain

Aug 1

This comment has been hidden

yeontaek

Aug 29

Hello, I reviewed the paper you suggested. Since I haven’t read through all the content of the paper, I am a bit confused.

In the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.

However, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.

Is there a difference between the two?

instruction-pretrain

Aug 29

•

edited Aug 29

Thanks for your question.

Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.

BOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.

Following GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are TEXT_IDs_N:

An input sequence for pre-training would look like this:
bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id

Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.

When fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we don’t need to add BOS and EOS token IDs during tokenization in fine-tuning.

Overall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.