Subhabrata Mukherjee commited on
Commit
468d2cd
1 Parent(s): 9ae157e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -3
README.md CHANGED
@@ -1,9 +1,51 @@
1
  ---
2
  language: en
3
  thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
4
- tags:
5
- - text-classification
6
  license: mit
7
  ---
8
 
9
- # XtremeDistilTransformers for Distilling Massive Neural Networks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language: en
3
  thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
 
 
4
  license: mit
5
  ---
6
 
7
+ # XtremeDistil-Transformers for Distilling Massive Neural Networks
8
+
9
+ XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "[XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf)" and "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)" with the following "[Github code](https://github.com/microsoft/xtreme-distil-transformers)".
10
+
11
+ This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
12
+
13
+ The following table shows the results on GLUE dev set and SQuAD-v2.
14
+
15
+ | Models | #Params | Speedup | MNLI | QNLI | QQP | RTE | SST | MRPC | SQUAD2 | Avg |
16
+ |----------------|--------|---------|------|------|------|------|------|------|--------|-------|
17
+ | BERT | 109 | 1x | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8 | 84.8 |
18
+ | DistilBERT | 66 | 2x | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7 | 81.3 |
19
+ | TinyBERT | 66 | 2x | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1 | 84.3 |
20
+ | MiniLM | 66 | 2x | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 76.4 | 84.9 |
21
+ | MiniLM | 22 | 5.3x | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9 | 83.3 |
22
+ | XtremeDistil | 22 | 5.3x | 85.4 | 90.3 | 91.0 | 80.9 | 92.3 | 90.0 | 76.6 | 86.6 |
23
+
24
+ If you use this checkpoint in your work, please cite:
25
+
26
+ ``` latex
27
+ @inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
28
+ title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
29
+ author = "Mukherjee, Subhabrata and
30
+ Hassan Awadallah, Ahmed",
31
+ booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
32
+ month = jul,
33
+ year = "2020",
34
+ address = "Online",
35
+ publisher = "Association for Computational Linguistics",
36
+ url = "https://www.aclweb.org/anthology/2020.acl-main.202",
37
+ doi = "10.18653/v1/2020.acl-main.202",
38
+ pages = "2221--2234",
39
+ }
40
+ ```
41
+
42
+ ``` latex
43
+ @misc{wang2020minilm,
44
+ title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
45
+ author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
46
+ year={2020},
47
+ eprint={2002.10957},
48
+ archivePrefix={arXiv},
49
+ primaryClass={cs.CL}
50
+ }
51
+ ```