license: afl-3.0
reStructured Pre-training (RST)
official repository, paper, easter eggs
RST is a new paradigm for language pre-training, which
- unifies 26 different types of signal from 10 data sources (Totten Tomatoes, Dailymail, Wikipedia, Wikidata, Wikihow, Wordnet, arXiv etc ) in the world structurally, being pre-trained with a monolithcal model,
- surpasses strong competitors (e.g., T0) on 52/55 popular datasets from a variety of NLP tasks (classification, IE, retrieval, generation etc)
- achieves superior performance in National College Entrance Examination (Gaokao-English, 高考-英语) achieves 40 points higher than the average scores made by students and 15 points higher than GPT3 with 1/16 parameters. In particular, Qin gets a high score of 138.5 (the full mark is 150) in the 2018 English exam
In such a pre-training paradigm,
- Data-centric Pre-training: the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing
- Pre-training over JSON instead of TEXT: a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.
Model Description
We release all models introduced in our paper, covering 13 different application scenarios. Each model contains 11 billion parameters.
Model | Description | Recommended Application |
---|---|---|
rst-all-11b | Trained with all the signals below except signals that are used to train Gaokao models | All applications below (specialized models are recommended first if high performance is preferred) |
rst-fact-retrieval-11b | Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing | Knowledge intensive tasks, information extraction tasks,factual checker |
rst-summarization-11b | Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary | Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore) |
rst-temporal-reasoning-11b | Trained with the following signals: DailyMail temporal information, wikiHow procedure | Temporal reasoning, relation extraction, event-based extraction |
rst-information-extraction-11b | Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity | Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains |
rst-intent-detection-11b | Trained with the following signals: wikiHow goal-step relation | Intent prediction, event prediction |
rst-topic-classification-11b | Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title | general text classification |
rst-word-sense-disambiguation-11b | Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym | Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning |
rst-natural-language-inference-11b | Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information | Natural language inference, multiple-choice question answering, reasoning |
rst-sentiment-classification-11b | Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment | Sentiment classification, emotion classification |
rst-gaokao-rc-11b | Trained with multiple-choice QA datasets that are used to train the T0pp model | General multiple-choice question answering |
rst-gaokao-cloze-11b | Trained with manually crafted cloze datasets | General cloze filling |
rst-gaokao-writing-11b | Trained with example essays from past Gaokao-English exams and grammar error correction signals | Essay writing, story generation, grammar error correction and other text generation tasks |
Have a try?
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b")
model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b")
inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
Data for reStructure Pre-training
This dataset is a precious treasure, containing a variety of naturally occurring signals. Any downstream task you can think of (e.g., the college entrance exam mentioned in the RST paper) can benefit from being pre-trained on some of our provided signals. We spent several months collecting the following 29 signal types, accounting for a total of 46,926,447 data samples. We hope this dataset will be a valuable asset for everyone in natural language processing research.
We provide collected signals through DataLab. For efficiency, we only provide 50,000 samples at most for each signal type. If you want all the samples we collected, please fill this form. More specifically, we collected the following signals.
We will be happy :smiley: to know if the resource is helpful for your work, and please cite our work :blush:
Mine | Signal | #Sample | Use in DataLab | Some Applications |
---|---|---|---|---|
Rotten Tomatoes | (review, rating) | 5,311,109 | load_dataset("rst", "rotten_tomatoes_sentiment") |
Sentiment classification |
Daily Mail | (text, category) | 899,904 | load_dataset("rst", "daily_mail_category") |
Topic classification |
Daily Mail | (title, text, summary) | 1,026,616 | load_dataset("rst", "daily_mail_summary") |
Summarization; Sentence expansion |
Daily Mail | (text, events) | 1,006,412 | load_dataset("rst", "daily_mail_temporal") |
Temporal reasoning |
Wikidata | (entity, entity_type, text) | 2,214,274 | load_dataset("rst", "wikidata_entity") |
Entity typing |
Wikidata | (subject, object, relation, text) | 1,526,674 | load_dataset("rst", "wikidata_relation") |
Relation extraction; Fact retrieval |
wikiHow | (text, category) | 112,109 | load_dataset("rst", "wikihow_text_category") |
Topic classification |
wikiHow | (low_category, high_category) | 4,868 | load_dataset("rst", "wikihow_category_hierarchy") |
Relation extraction; Commonsense reasoning |
wikiHow | (goal, steps) | 47,956 | load_dataset("rst", "wikihow_goal_step") |
Intent detection |
wikiHow | (text, summary) | 703,278 | load_dataset("rst", "wikihow_summary") |
Summarization; Sentence expansion |
wikiHow | (goal, first_step, second_step) | 47,787 | load_dataset("rst", "wikihow_procedure") |
Temporal reasoning |
wikiHow | (question, description, answer, related_questions) | 47,705 | load_dataset("rst", "wikihow_question") |
Question generation |
Wikipedia | (text, entities) | 22,231,011 | load_dataset("rst", "wikipedia_entities") |
Entity recognition |
Wikipedia | (texts, titles) | 3,296,225 | load_dataset("rst", "wikipedia_sections") |
Summarization |
WordNet | (word, sentence, pos) | 27,123 | load_dataset("rst", "wordnet_pos") |
Part-of-speech tagging |
WordNet | (word, sentence, meaning, possible_meanings) | 27,123 | load_dataset("rst", "wordnet_meaning") |
Word sense disambiguation |
WordNet | (word, sentence, synonyms) | 17,804 | load_dataset("rst", "wordnet_synonym") |
Paraphrasing |
WordNet | (word, sentence, antonyms) | 6,408 | load_dataset("rst", "wordnet_antonym") |
Negation |
ConTRoL | (premise, hypothesis, label) | 8,323 | load_dataset("rst", "qa_control") |
Natural language inference |
DREAM | (context, question, options, answer) | 9,164 | load_dataset("rst", "qa_dream") |
Reading comprehension |
LogiQA | (context, question, options, answer) | 7,974 | load_dataset("rst", "qa_logiqa") |
Reading comprehension |
ReClor | (context, question, options, answer) | 5,138 | load_dataset("rst", "qa_reclor") |
Reading comprehension |
RACE | (context, question, options, answer) | 44,880 | load_dataset("rst", "qa_race") |
Reading comprehension |
RACE-C | (context, question, options, answer) | 5,093 | load_dataset("rst", "qa_race_c") |
Reading comprehension |
TriviaQA | (context, question, answer) | 46,636 | load_dataset("rst", "qa_triviaqa") |
Reading comprehension |
Arxiv | (text, category) | 1,696,348 | load_dataset("rst", "arxiv_category") |
Topic classification |
Arxiv | (text, summary) | 1,696,348 | load_dataset("rst", "arxiv_summary") |
Summarization; Sentence expansion |
Paperswithcode | (text, entities, datasets, methods, tasks, metrics) | 4,731,233 | load_dataset("rst", "paperswithcode_entity") |
Entity recognition |
Paperswithcode | (text, summary) | 120,924 | load_dataset("rst", "paperswithcode_summary") |
Summarization; Sentence expansion |
Bibtext for Citation Info
@article{yuan2022restructured,
title={reStructured Pre-training},
author={Yuan, Weizhe and Liu, Pengfei},
journal={arXiv preprint arXiv:2206.11147},
year={2022}
}