Papers
arxiv:2406.14491

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Published on Jun 20
· Submitted by daixuancheng on Jun 21
#2 Paper of the day
Authors:
,
,

Abstract

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

Community

Paper author Paper submitter
edited about 16 hours ago

🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗

Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)?

·

Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!

This comment has been hidden

Hi - is the dataset you released the 200M one?

·

Hi,

This is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too large🤔.

Hi, thanks for your work! I get few interesting insights from this.

  1. Based on your results, can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.

  2. Do you perform any verifications to the generated instruction-response pairs?

·

Hi,

Thanks for your question!

Q1: Can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly?

This is a promising approach worth trying. However, it may come with two limitations:

  1. Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.

For example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.
image.png

  1. Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.

Q2: Do you perform any verifications of the generated instruction-response pairs?

Yes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.

image.png

In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?

·

Thanks for your careful review. The domain names in Table 4 should be reversed.

Have you given any thought to training a version on a higher ratio of code?

I am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.

·

Hi,

Thanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.

Sign up or log in to comment

Models citing this paper 14

Browse 14 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 23