Model Card for EntityCS-39-PEP_MS_MLM-xlmr-base

This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages.
The corpus can be found in https://huggingface.co/huawei-noah/entity_cs, check the link for more details.

Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).

To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different languages. Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (WEP) and Partial Entity Prediction (PEP).

In WEP, motivated by Sun et al. (2019) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and 20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked entity, we do not allow replacing with Random subwords, since it can introduce noise and result in the model predicting incorrect entities. After entities are masked, we remove the entity indicators <e>, </e> from the sentences before feeding them to the model.

For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First, PEP_MRS, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second setting, PEP_MS, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked subwords and 10% Same subwords from the masking candidates. In the third setting, PEP_M, we further remove the 10% Same subwords prediction, essentially predicting only the masked subwords.

Prior work has proven it is effective to combine Entity Prediction with MLM for cross-lingual transfer (Jiang et al., 2020), therefore we investigate the combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the entity masking probability (p) to 50% to roughly keep the same overall masking percentage. This results into the following objectives: WEP + MLM, PEP_MRS + MLM, PEP_MS + MLM, PEP_M + MLM

This model was trained with the PEP_MS + MLM objective on the EntityCS corpus with 39 languages.

Languages: English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao

Model Details

Training Details

We start from the XLM-R-base model and train for 1 epoch on 8 Nvidia V100 32GB GPUs. We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256. For speedup we use fp16 mixed precision. We use the sampling strategy proposed by Conneau and Lample (2019), where high resource languages are down-sampled and low resource languages get sampled more frequently. We only train the embedding and the last two layers of the model. We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.

This checkpoint corresponds to the one with the lower perplexity on the validation set.

Usage

The current model can be used for further fine-tuning on downstream tasks. In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation and Slot Filling.

Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as X-FACTR.

How to Get Started with the Model

Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS

Citation

BibTeX:

@inproceedings{whitehouse-etal-2022-entitycs,
    title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching",
    author = "Whitehouse, Chenxi  and
      Christopoulou, Fenia  and
      Iacobacci, Ignacio",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.499",
    pages = "6698--6714"
}

Model Card Contact

Fenia Christopoulou