Subhabrata Mukherjee
commited on
Commit
•
79d89b0
1
Parent(s):
852b1fe
Create README.md
Browse files
README.md
CHANGED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
|
4 |
+
tags:
|
5 |
+
- text-classification
|
6 |
+
license: mit
|
7 |
+
---
|
8 |
+
|
9 |
+
# XtremeDistil-Transformers for Distilling Massive Neural Networks
|
10 |
+
|
11 |
+
XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "[XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf)" and "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)" with the following "[Github code](https://github.com/microsoft/xtreme-distil-transformers)".
|
12 |
+
|
13 |
+
This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
|
14 |
+
|
15 |
+
The following table shows the results on GLUE dev set and SQuAD-v2.
|
16 |
+
|
17 |
+
| Models | #Params | Speedup | MNLI | QNLI | QQP | RTE | SST | MRPC | SQUAD2 | Avg |
|
18 |
+
|----------------|--------|---------|------|------|------|------|------|------|--------|-------|
|
19 |
+
| BERT | 109 | 1x | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8 | 84.8 |
|
20 |
+
| DistilBERT | 66 | 2x | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7 | 81.3 |
|
21 |
+
| TinyBERT | 66 | 2x | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1 | 84.3 |
|
22 |
+
| MiniLM | 66 | 2x | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 76.4 | 84.9 |
|
23 |
+
| MiniLM | 22 | 5.3x | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9 | 83.3 |
|
24 |
+
| XtremeDistil-l6-h256 | 13 | 8.7x | 83.9 | 89.5 | 90.6 | 80.1 | 91.2 | 90.0 | 74.1 | 85.6 |
|
25 |
+
| XtremeDistil-l6-h384 | 22 | 5.3x | 85.4 | 90.3 | 91.0 | 80.9 | 92.3 | 90.0 | 76.6 | 86.6 |
|
26 |
+
| XtremeDistil-l12-h384 | 33 | 2.7x | 87.2 | 91.9 | 91.3 | 85.6 | 93.1 | 90.4 | 80.2 | 88.5 |
|
27 |
+
|
28 |
+
Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`
|
29 |
+
|
30 |
+
If you use this checkpoint in your work, please cite:
|
31 |
+
|
32 |
+
``` latex
|
33 |
+
@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
|
34 |
+
title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
|
35 |
+
author = "Mukherjee, Subhabrata and
|
36 |
+
Hassan Awadallah, Ahmed",
|
37 |
+
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
|
38 |
+
month = jul,
|
39 |
+
year = "2020",
|
40 |
+
address = "Online",
|
41 |
+
publisher = "Association for Computational Linguistics",
|
42 |
+
url = "https://www.aclweb.org/anthology/2020.acl-main.202",
|
43 |
+
doi = "10.18653/v1/2020.acl-main.202",
|
44 |
+
pages = "2221--2234",
|
45 |
+
}
|
46 |
+
```
|
47 |
+
|
48 |
+
``` latex
|
49 |
+
@misc{wang2020minilm,
|
50 |
+
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
|
51 |
+
author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
|
52 |
+
year={2020},
|
53 |
+
eprint={2002.10957},
|
54 |
+
archivePrefix={arXiv},
|
55 |
+
primaryClass={cs.CL}
|
56 |
+
}
|
57 |
+
```
|