nazneen commited on
Commit
a893c2b
1 Parent(s): 160deb7

model documentation

Browse files
Files changed (1) hide show
  1. README.md +236 -2
README.md CHANGED
@@ -1,2 +1,236 @@
1
- # MiniLMv2
2
- This is a MiniLMv2 model from: [https://github.com/microsoft/unilm](https://github.com/microsoft/unilm/tree/master/minilm)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+
5
+ ---
6
+ # Model Card for MiniLMv2
7
+ **Small and fast pre-trained models for language understanding and generation**
8
+
9
+
10
+
11
+
12
+ # Model Details
13
+
14
+ ## Model Description
15
+
16
+ **MiniLM v2**: the pre-trained models for the paper entitled "[MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers](https://arxiv.org/abs/2012.15828)". We generalize deep self-attention distillation in MiniLMv1 by using self-attention relation distillation for task-agnostic compression of pre-trained Transformers. The proposed method eliminates the restriction on the number of student’s attention heads. Our monolingual and multilingual small models distilled from different base and large size teacher models achieve competitive performance.
17
+
18
+
19
+ - **Developed by:** More information needed
20
+ - **Shared by [Optional]:** More information needed
21
+ - **Model type:** Language model
22
+ - **Language(s) (NLP):** en
23
+ - **License:** [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
24
+ - **Related Models:** {{ related_models | join(', ') | default("More information needed", true)}}
25
+ - **Parent Model:** xlm-roberta
26
+ - **Resources for more information:**
27
+ - [GitHub Repo](https://github.com/microsoft/unilm)
28
+ - [Associated Paper]({https://arxiv.org/abs/2012.15828)
29
+
30
+
31
+ # Uses
32
+
33
+ ## Direct Use
34
+
35
+ More information is needed.
36
+
37
+ ## Downstream Use [Optional]
38
+
39
+ More information is needed
40
+
41
+ ## Out-of-Scope Use
42
+
43
+
44
+ The model should not be used to intentionally create hostile or alienating environments for people.
45
+
46
+ # Bias, Risks, and Limitations
47
+
48
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
49
+
50
+
51
+ ## Recommendations
52
+
53
+
54
+
55
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information is needed for further recomendations.
56
+
57
+
58
+ # Training Details
59
+
60
+ ## Training Data
61
+
62
+ Following [Lewis et al. (2019b)](https://arxiv.org/abs/1910.07475), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.
63
+
64
+ ## Training Procedure
65
+
66
+
67
+ ### Preprocessing
68
+
69
+ More information needed
70
+
71
+ ### Speeds, Sizes, Times
72
+
73
+ We compress XLMR-Large into 12-layer and 6-layer models with 384 hidden size and report the zero-shot performance on XNLI and MLQA test set.
74
+
75
+ **[English] Pre-trained Models**
76
+
77
+ | Model | Teacher Model | Speedup | #Param | MNLI-m (Acc) | SQuAD 2.0 (F1) |
78
+ |------------------------------------------------------------------------------|-----------------------|-----------|-----------|--------------|----------------|
79
+ | **[L6xH768 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qiyUqoRUc6P1t0mk0)** | RoBERTa-Large | 2.0x | 81M | 87.0 | 81.6 |
80
+ | **[L12xH384 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qiyM5cFiv7ew6uOO1)** | RoBERTa-Large | 2.7x | 41M | 86.9 | 82.3 |
81
+ | **[L6xH384 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qix6eX4PZbP2_N2MO)** | RoBERTa-Large | 5.3x | 30M | 84.4 | 76.4 |
82
+ | [L6xH768 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qix8oZl0UtY-KnJtY) | BERT-Large Uncased | 2.0x | 66M | 85.0 | 77.7 |
83
+ | [L6xH384 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qix0MfI2hDsmK20cY) | BERT-Large Uncased | 5.3x | 22M | 83.0 | 74.3 |
84
+ | [L6xH768 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qiyLmvLxXOSgpTxxm) | BERT-Base Uncased | 2.0x | 66M | 84.2 | 76.3 |
85
+ | [L6xH384 MiniLMv2](https://1drv.ms/u/s!AjHn0yEmKG8qiyQSo9sRDP4t3_jT) | BERT-Base Uncased | 5.3x | 22M | 82.8 | 72.9 |
86
+
87
+
88
+ # Evaluation
89
+
90
+
91
+ ## Testing Data, Factors & Metrics
92
+
93
+ ### Testing Data
94
+
95
+ #### Fine-tuning on NLU tasks
96
+ MiniLM has the same Transformer architecture as BERT. For NLU tasks, our models in Pytorch version can be loaded using the BERT code in [huggingface/transformers](https://github.com/huggingface/transformers). The config file is needed to be replaced with MiniLM's.
97
+
98
+ We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.
99
+
100
+
101
+ | Model | #Param | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE | MRPC | QQP |
102
+ |---------------------------------------------------|--------|-----------|--------|-------|------|------|------|------|------|
103
+ | [BERT-Base](https://arxiv.org/pdf/1810.04805.pdf) | 109M | 76.8 | 84.5 | 93.2 | 91.7 | 58.9 | 68.6 | 87.3 | 91.3 |
104
+ | **MiniLM-L12xH384** | 33M | 81.7 | 85.7 | 93.0 | 91.5 | 58.5 | 73.3 | 89.5 | 91.3 |
105
+ | **MiniLM-L6xH384** | 22M | 75.6 | 83.3 | 91.5 | 90.5 | 47.5 | 68.8 | 88.9 | 90.6 |
106
+
107
+
108
+ ### Factors
109
+
110
+ More information needed
111
+
112
+ ### Metrics
113
+ We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).
114
+
115
+ #### Cross-Lingual Natural Language Inference - [XNLI](https://arxiv.org/abs/1809.05053)
116
+
117
+ We evaluate our model on cross-lingual transfer from English to other languages. Following [Conneau et al. (2019)](https://arxiv.org/abs/1911.02116), we select the best single model on the joint dev set of all the languages.
118
+
119
+ | Model | #Layers | #Hidden | #Transformer Parameters | Average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
120
+ |---------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
121
+ | [mBERT](https://github.com/google-research/bert) | 12 | 768 | 85M | 66.3 | 82.1 | 73.8 | 74.3 | 71.1 | 66.4 | 68.9 | 69.0 | 61.6 | 64.9 | 69.5 | 55.8 | 69.3 | 60.0 | 50.4 | 58.0 |
122
+ | [XLM-100](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 16 | 1280 | 315M | 70.7 | 83.2 | 76.7 | 77.7 | 74.0 | 72.7 | 74.1 | 72.7 | 68.7 | 68.6 | 72.9 | 68.9 | 72.5 | 65.6 | 58.2 | 62.4 |
123
+ | [XLM-R Base](https://arxiv.org/abs/1911.02116) | 12 | 768 | 85M | 74.5 | 84.6 | 78.4 | 78.9 | 76.8 | 75.9 | 77.3 | 75.4 | 73.2 | 71.5 | 75.4 | 72.5 | 74.9 | 71.1 | 65.2 | 66.5 |
124
+ | **mMiniLM-L12xH384** | 12 | 384 | 21M | 71.1 | 81.5 | 74.8 | 75.7 | 72.9 | 73.0 | 74.5 | 71.3 | 69.7 | 68.8 | 72.1 | 67.8 | 70.0 | 66.2 | 63.3 | 64.2 |
125
+
126
+
127
+
128
+
129
+ ## Results
130
+
131
+ We present the results following the same [data split](https://github.com/xinyadu/nqg/tree/master/data) as in [(Du et al., 2017)](https://arxiv.org/pdf/1705.00106.pdf).
132
+
133
+ | Model | #Param | BLEU-4 | METEOR | ROUGE-L |
134
+ |-------------------------------|------------|-----------|-------------|-------------|
135
+ | **MiniLM-L12xH384** | 33M | 21.07 | 24.09. | 49.14 |
136
+ | **MiniLM-L6xH384** | 22M | 20.31 | 23.43 | 48.21 |
137
+
138
+ We also report the results following the data split as in [(Zhao et al., 2018)](https://aclweb.org/anthology/D18-1424), which uses the reversed dev-test setup.
139
+
140
+ | Model | #Param | BLEU-4 | METEOR | ROUGE-L |
141
+ |-------------------------------|------------|-----------|-------------|-------------|
142
+ | **MiniLM-L12xH384** | 33M | 23.27 | 25.15 | 50.60 |
143
+ | **MiniLM-L6xH384** | 22M | 22.01 | 24.24 | 49.51 |
144
+
145
+
146
+
147
+ # Model Examination
148
+
149
+ More information needed
150
+
151
+ # Environmental Impact
152
+
153
+
154
+
155
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
156
+
157
+ - **Hardware Type:** More information needed
158
+ - **Hours used:** More information needed
159
+ - **Cloud Provider:**More information needed
160
+ - **Compute Region:** More information needed
161
+ - **Carbon Emitted:** More information needed
162
+
163
+ # Technical Specifications [optional]
164
+
165
+ ## Model Architecture and Objective
166
+
167
+ More information needed
168
+
169
+ ## Compute Infrastructure
170
+
171
+ More information needed
172
+
173
+ ### Hardware
174
+
175
+ More information needed
176
+
177
+ ### Software
178
+ More information needed
179
+
180
+ # Citation
181
+
182
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
183
+
184
+ **BibTeX:**
185
+
186
+ If you find MiniLM useful in your research, please cite the following paper:
187
+
188
+ ``` latex
189
+ @misc{wang2020minilm,
190
+ title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
191
+ author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
192
+ year={2020},
193
+ eprint={2002.10957},
194
+ archivePrefix={arXiv},
195
+ primaryClass={cs.CL}
196
+ }
197
+ ```
198
+
199
+
200
+ # Glossary [optional]
201
+
202
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
203
+
204
+ More information needed
205
+
206
+ # More Information [optional]
207
+
208
+ More information needed
209
+
210
+ # Model Card Authors [optional]
211
+
212
+ Wenhui Wang , Furu Wei
213
+ # Model Card Contact
214
+
215
+ For other communications related to MiniLM, please contact Wenhui Wang (`wenwan@microsoft.com`), Furu Wei (`fuwei@microsoft.com`).
216
+
217
+
218
+ # How to Get Started with the Model
219
+
220
+ Use the code below to get started with the model.
221
+
222
+ <details>
223
+ <summary> Click to expand </summary>
224
+
225
+ ```python
226
+ >>> from transformers import AutoTokenizer, AutoModel
227
+
228
+ >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
229
+ >>> model = AutoModel.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
230
+
231
+ >>> inputs = tokenizer("Hello world!", return_tensors="pt")
232
+ >>> outputs = model(**inputs)
233
+ ```
234
+
235
+ </details>
236
+