metadata

language: en
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
tags:
  - text-classification
license: mit

XtremeDistil-Transformers for Distilling Massive Neural Networks

XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "XtremeDistil: Multi-stage Distillation for Massive Multilingual Models" and "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers" with the following "Github code".

This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.

The following table shows the results on GLUE dev set and SQuAD-v2.

Models	#Params	Speedup	MNLI	QNLI	QQP	RTE	SST	MRPC	SQUAD2	Avg
BERT	109	1x	84.5	91.7	91.3	68.6	93.2	87.3	76.8	84.8
DistilBERT	66	2x	82.2	89.2	88.5	59.9	91.3	87.5	70.7	81.3
TinyBERT	66	2x	83.5	90.5	90.6	72.2	91.6	88.4	73.1	84.3
MiniLM	66	2x	84.0	91.0	91.0	71.5	92.0	88.4	76.4	84.9
MiniLM	22	5.3x	82.8	90.3	90.6	68.9	91.3	86.6	72.9	83.3
XtremeDistil-l6-h256	13	8.7x	83.9	89.5	90.6	80.1	91.2	90.0	74.1	85.6
XtremeDistil-l6-h384	22	5.3x	85.4	90.3	91.0	80.9	92.3	90.0	76.6	86.6
XtremeDistil-l12-h384	33	2.7x	87.2	91.9	91.3	85.6	93.1	90.4	80.2	88.5

Tested with tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0

If you use this checkpoint in your work, please cite:

@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
    title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
    author = "Mukherjee, Subhabrata  and
      Hassan Awadallah, Ahmed",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.202",
    doi = "10.18653/v1/2020.acl-main.202",
    pages = "2221--2234",
}

@misc{wang2020minilm,
    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
    year={2020},
    eprint={2002.10957},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}