Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md
README.md
ADDED
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
|
3 |
+
tags:
|
4 |
+
- text-classification
|
5 |
+
license: mit
|
6 |
+
---
|
7 |
+
|
8 |
+
## MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
|
9 |
+
|
10 |
+
MiniLM is a distilled model from the paper "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)".
|
11 |
+
|
12 |
+
Please find the information about preprocessing, training and full details of the MiniLM in the [original MiniLM repository](https://github.com/microsoft/unilm/blob/master/minilm/).
|
13 |
+
|
14 |
+
Please note: This checkpoint uses `BertModel` with `XLMRobertaTokenizer` so `AutoTokenizer` won't work with this checkpoint!
|
15 |
+
|
16 |
+
### Multilingual Pretrained Model
|
17 |
+
- Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters
|
18 |
+
|
19 |
+
Multilingual MiniLM uses the same tokenizer as XLM-R. But the Transformer architecture of our model is the same as BERT. We provide the fine-tuning code on XNLI based on [huggingface/transformers](https://github.com/huggingface/transformers). Please replace `run_xnli.py` in transformers with [ours](https://github.com/microsoft/unilm/blob/master/minilm/examples/run_xnli.py) to fine-tune multilingual MiniLM.
|
20 |
+
|
21 |
+
We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).
|
22 |
+
|
23 |
+
#### Cross-Lingual Natural Language Inference - [XNLI](https://arxiv.org/abs/1809.05053)
|
24 |
+
|
25 |
+
We evaluate our model on cross-lingual transfer from English to other languages. Following [Conneau et al. (2019)](https://arxiv.org/abs/1911.02116), we select the best single model on the joint dev set of all the languages.
|
26 |
+
|
27 |
+
| Model | #Layers | #Hidden | #Transformer Parameters | Average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
|
28 |
+
|---------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
|
29 |
+
| [mBERT](https://github.com/google-research/bert) | 12 | 768 | 85M | 66.3 | 82.1 | 73.8 | 74.3 | 71.1 | 66.4 | 68.9 | 69.0 | 61.6 | 64.9 | 69.5 | 55.8 | 69.3 | 60.0 | 50.4 | 58.0 |
|
30 |
+
| [XLM-100](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 16 | 1280 | 315M | 70.7 | 83.2 | 76.7 | 77.7 | 74.0 | 72.7 | 74.1 | 72.7 | 68.7 | 68.6 | 72.9 | 68.9 | 72.5 | 65.6 | 58.2 | 62.4 |
|
31 |
+
| [XLM-R Base](https://arxiv.org/abs/1911.02116) | 12 | 768 | 85M | 74.5 | 84.6 | 78.4 | 78.9 | 76.8 | 75.9 | 77.3 | 75.4 | 73.2 | 71.5 | 75.4 | 72.5 | 74.9 | 71.1 | 65.2 | 66.5 |
|
32 |
+
| **mMiniLM-L12xH384** | 12 | 384 | 21M | 71.1 | 81.5 | 74.8 | 75.7 | 72.9 | 73.0 | 74.5 | 71.3 | 69.7 | 68.8 | 72.1 | 67.8 | 70.0 | 66.2 | 63.3 | 64.2 |
|
33 |
+
|
34 |
+
This example code fine-tunes **12**-layer multilingual MiniLM on XNLI.
|
35 |
+
|
36 |
+
```bash
|
37 |
+
# run fine-tuning on XNLI
|
38 |
+
DATA_DIR=/{path_of_data}/
|
39 |
+
OUTPUT_DIR=/{path_of_fine-tuned_model}/
|
40 |
+
MODEL_PATH=/{path_of_pre-trained_model}/
|
41 |
+
|
42 |
+
python ./examples/run_xnli.py --model_type minilm \
|
43 |
+
--output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
|
44 |
+
--model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
|
45 |
+
--tokenizer_name xlm-roberta-base \
|
46 |
+
--config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
|
47 |
+
--do_train \
|
48 |
+
--do_eval \
|
49 |
+
--max_seq_length 128 \
|
50 |
+
--per_gpu_train_batch_size 128 \
|
51 |
+
--learning_rate 5e-5 \
|
52 |
+
--num_train_epochs 5 \
|
53 |
+
--per_gpu_eval_batch_size 32 \
|
54 |
+
--weight_decay 0.001 \
|
55 |
+
--warmup_steps 500 \
|
56 |
+
--save_steps 1500 \
|
57 |
+
--logging_steps 1500 \
|
58 |
+
--eval_all_checkpoints \
|
59 |
+
--language en \
|
60 |
+
--fp16 \
|
61 |
+
--fp16_opt_level O2
|
62 |
+
```
|
63 |
+
|
64 |
+
#### Cross-Lingual Question Answering - [MLQA](https://arxiv.org/abs/1910.07475)
|
65 |
+
|
66 |
+
Following [Lewis et al. (2019b)](https://arxiv.org/abs/1910.07475), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.
|
67 |
+
|
68 |
+
| Model F1 Score | #Layers | #Hidden | #Transformer Parameters | Average | en | es | de | ar | hi | vi | zh |
|
69 |
+
|--------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------|
|
70 |
+
| [mBERT](https://github.com/google-research/bert) | 12 | 768 | 85M | 57.7 | 77.7 | 64.3 | 57.9 | 45.7 | 43.8 | 57.1 | 57.5 |
|
71 |
+
| [XLM-15](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 12 | 1024 | 151M | 61.6 | 74.9 | 68.0 | 62.2 | 54.8 | 48.8 | 61.4 | 61.1 |
|
72 |
+
| [XLM-R Base](https://arxiv.org/abs/1911.02116) (Reported) | 12 | 768 | 85M | 62.9 | 77.8 | 67.2 | 60.8 | 53.0 | 57.9 | 63.1 | 60.2 |
|
73 |
+
| [XLM-R Base](https://arxiv.org/abs/1911.02116) (Our fine-tuned) | 12 | 768 | 85M | 64.9 | 80.3 | 67.0 | 62.7 | 55.0 | 60.4 | 66.5 | 62.3 |
|
74 |
+
| **mMiniLM-L12xH384** | 12 | 384 | 21M | 63.2 | 79.4 | 66.1 | 61.2 | 54.9 | 58.5 | 63.1 | 59.0 |
|
75 |
+
|
76 |
+
### Citation
|
77 |
+
|
78 |
+
If you find MiniLM useful in your research, please cite the following paper:
|
79 |
+
|
80 |
+
``` latex
|
81 |
+
@misc{wang2020minilm,
|
82 |
+
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
|
83 |
+
author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
|
84 |
+
year={2020},
|
85 |
+
eprint={2002.10957},
|
86 |
+
archivePrefix={arXiv},
|
87 |
+
primaryClass={cs.CL}
|
88 |
+
}
|
89 |
+
```
|