datasets:
- EleutherAI/pile
- Open-Orca/OpenOrca
- GAIR/lima
- WizardLM/WizardLM_evol_instruct_V2_196k
language:
- en
license: llama3
tags:
- biology
- medical
Instruction Pre-Training: Language Models are Supervised Multitask Learners
This repo contains the biomedicine model developed from Llama3-8B in our paper Instruction Pre-Training: Language Models are Supervised Multitask Learners.
We explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. Instruction Pre-Training outperforms Vanilla Pre-training in both general pre-training from scratch and domain-adaptive continual pre-training. In pre-training from scratch, Instruction Pre-Training not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B.
Resources
🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗
- Context-Based Instruction Synthesizer: instruction-synthesizer
- Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection
- General Models Pre-Trained from Scratch:
- Domain-Specific Models Pre-Trained from Llama3-8B:
Domain-Adaptive Continued Pre-Training
Following AdaptLLM, we augment the domain-specific raw corpora with instruction-response pairs generated by our context-based instruction synthesizer.
For example, to chat with the biomedicine-Llama3-8B model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("instruction-pretrain/medicine-Llama3-8B")
tokenizer = AutoTokenizer.from_pretrained("instruction-pretrain/medicine-Llama3-8B")
# Put your input here, NO prompt template is required
user_input = '''Question: Which of the following is an example of monosomy?
Options:
- 46,XX
- 47,XXX
- 69,XYY
- 45,X
Please provide your choice first and then provide explanations if possible.'''
inputs = tokenizer(user_input, return_tensors="pt", add_special_tokens=True).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_new_tokens=400)[0]
answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)
print(pred)
Citation
If you find our work helpful, please cite us:
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}