|
--- |
|
license: mit |
|
--- |
|
# mPMR: A Multilingual Pre-trained Machine Reader at Scale |
|
Multilingual Pre-trained Machine Reader (mPMR) is a multilingual extension of PMR. |
|
mPMR is pre-trained with 18 million Machine Reading Comprehension (MRC) examples constructed with Wikipedia Hyperlinks. |
|
It was introduced in the paper mPMR: A Multilingual Pre-trained Machine Reader at Scale by |
|
Weiwen Xu, Xin Li, Wai Lam, Lidong Bing |
|
and first released in [this repository](https://github.com/DAMO-NLP-SG/PMR). |
|
|
|
This model is initialized with xlm-roberta-base and further continued pre-trained with an MRC objective. |
|
|
|
## Model description |
|
The model is pre-trained with distantly labeled data using a learning objective called Wiki Anchor Extraction (WAE). |
|
Specifically, we constructed a large volume of general-purpose and high-quality MRC-style training data based on Wikipedia anchors (i.e., hyperlinked texts). |
|
For each Wikipedia anchor, we composed a pair of correlated articles. |
|
One side of the pair is the Wikipedia article that contains detailed descriptions of the hyperlinked entity, which we defined as the definition article. |
|
The other side of the pair is the article that mentions the specific anchor text, which we defined as the mention article. |
|
We composed an MRC-style training instance in which the anchor is the answer, |
|
the surrounding passage of the anchor in the mention article is the context, and the definition of the anchor entity in the definition article is the query. |
|
Based on the above data, we then introduced a novel WAE problem as the pre-training task of mPMR. |
|
In this task, mPMR determines whether the context and the query are relevant. |
|
If so, mPMR extracts the answer from the context that satisfies the query description. |
|
|
|
During fine-tuning, we unified downstream NLU tasks in our MRC formulation, which typically falls into four categories: |
|
(1) span extraction with pre-defined labels (e.g., NER) in which each task label is treated as a query to search the corresponding answers in the input text (context); |
|
(2) span extraction with natural questions (e.g., EQA) in which the question is treated as the query for answer extraction from the given passage (context); |
|
(3) sequence classification with pre-defined task labels, such as sentiment analysis. Each task label is used as a query for the input text (context); and |
|
(4) sequence classification with natural questions on multiple choices, such as multi-choice QA (MCQA). We treated the concatenation of the question and one choice as the query for the given passage (context). |
|
Then, in the output space, we tackle span extraction problems by predicting the probability of context span being the answer. |
|
We tackle sequence classification problems by conducting relevance classification on [CLS] (extracting [CLS] if relevant). |
|
|
|
## Model variations |
|
There are two versions of models released. The details are: |
|
|
|
| Model | Backbone | #params | |
|
|------------|-----------|----------| |
|
| [**mPMR-base** (this checkpoint)](https://huggingface.co/DAMO-NLP-SG/mPMR-base) | [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | 270M | |
|
| [mPMR-large](https://huggingface.co/DAMO-NLP-SG/mPMR-large) | [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) | 550M | |
|
|
|
|
|
|
|
## Intended uses & limitations |
|
The models need to be fine-tuned on the data downstream tasks. During fine-tuning, no task-specific layer is required. |
|
|
|
### How to use |
|
You can try the codes from [this repo](https://github.com/DAMO-NLP-SG/mPMR). |
|
|
|
|
|
|
|
### BibTeX entry and citation info |
|
```bibtxt |
|
@article{xu2022clozing, |
|
title={From Clozing to Comprehending: Retrofitting Pre-trained Language Model to Pre-trained Machine Reader}, |
|
author={Xu, Weiwen and Li, Xin and Zhang, Wenxuan and Zhou, Meng and Bing, Lidong and Lam, Wai and Si, Luo}, |
|
journal={arXiv preprint arXiv:2212.04755}, |
|
year={2022} |
|
} |
|
@inproceedings{xu2022mpmr, |
|
title = "mPMR: A Multilingual Pre-trained Machine Reader at Scale", |
|
author = "Xu, Weiwen and |
|
Li, Xin and |
|
Lam, Wai and |
|
Bing, Lidong", |
|
booktitle = "The 61th Annual Meeting of the Association for Computational Linguistics.", |
|
year = "2023", |
|
} |
|
|
|
``` |