EdgeFormer

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation. Tao Ge and Furu Wei

March 2022: release code and pretrained checkpoints.

Pretrained Models

EdgeFormer (Adapter-LA): #enc-dec=12-2; #hidden=512; #head=8; #enc-FFN=2048, #dec-FFN=128, #LoRA-r=32 (#parameters: 11M)
Vocabulary and Sentencepiece-model
EdgeFormer can be finetuned to support seq2seq generation in English (by now).

Downstream seq2seq tasks

We evaluate EdgeFormer on the benchmarks of three popular seq2seq tasks: CoNLL-14 for GEC, XSUM for Abstractive Summarization, and SQuAD-NQG for Question Generation.

CoNLL-14

Model	#Params	#FLOPS	F0.5
Transformer-base	44M	1.8G	50.1
Pretrained 12+2 Universal Transformer	7.4M	1.4G	51.3
Pretrained 12+2 Universal Transformer (wide)	9.4M	1.9G	51.7
Pretrained EdgeFormer	9.4M	1.3G	52.7

XSUM

Model	#Params	#FLOPS	ROUGE-1	ROUGE-2	ROUGE-L
Transformer-base	44M	1.8G	31.2	10.7	24.9
Pretrained 12+2 Universal Transformer	7.4M	1.4G	34.4	13.4	27.9
Pretrained 12+2 Universal Transformer (wide)	9.4M	1.9G	35.1	14.0	28.6
Pretrained EdgeFormer	9.4M	1.3G	36.3	14.8	29.5

SQuAD-NQG

Model	#Params	#FLOPS	B4	MTR	ROUGE-L
Transformer-base	44M	1.8G	2.6	9.0	26.0
Pretrained 12+2 Universal Transformer	7.4M	1.4G	18.3	21.0	45.9
Pretrained 12+2 Universal Transformer (wide)	9.4M	1.9G	18.7	21.3	46.1
Pretrained EdgeFormer	9.4M	1.3G	19.0	21.7	46.3

Setup

pip install --editable ./


## Fine-tuning
```bash
PRETRAINED_MODEL=/path/to/checkpoint/model.pt
fairseq-train /path/to/binarized/data \
        --restore-file $PRETRAINED_MODEL  --reset-lr-scheduler --reset-optimizer --reset-dataloader \
        --task translation \
        --criterion label_smoothed_cross_entropy \
        --arch transformer_edge --encoder-layers 12 --decoder-ffn-embed-dim 128 --lora-r 32 --lora-r-shape 0 \
        --share-all-embeddings \
        --required-batch-size-multiple 8 \
        --optimizer adam \
        --adam-betas '(0.9,0.98)' \
        --adam-eps 1e-6 \
        --clip-norm 1.0 \
        --lr-scheduler polynomial_decay \
        --lr 0.00015 \
        --warmup-updates 8000 \
        --total-num-update 100000 \
        --max-update 100000 --max-epoch 1000 \
        --max-tokens 20000 \
        --update-freq 1 \
        --log-format simple \
        --log-interval 1000 \
        --save-interval-updates 5000 \
        --fp16 \
        --fp16-init-scale 4 \
        --fp16-scale-window 256 \
        --min-loss-scale 0.0001 \
        --seed 1
        --save-dir /path/to/save/checkpoints
        --ddp-backend legacy_ddp

**Note:

Please adjust the hyperparameters like lr and warmup-updates based on the datasets and tasks.
Please adjust the max-tokens and update-freq to suit in different experimental environments.
Use --fp16 for more efficient training on the devices that have Tensor Cores.

Evaluation:

fairseq-generate $data_bin \
    --path $save_dir/checkpoint_best.pt \
    --batch-size 64 --beam 5 --remove-bpe=sentencepiece

Citation

If you find this repository useful, please consider citing our work:

@article{ge2022edgeformer,
  title={EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation},
  author={Ge, Tao and Wei, Furu},
  journal={arXiv preprint arXiv:2202.07959},
  year={2022}
}

Acknowledgement

This repository is built using the Fairseq repository.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using EdgeFormer models, please submit a GitHub issue.

For other communications related to EdgeFormer, please contact Tao Ge (tage@microsoft.com), Furu Wei (fuwei@microsoft.com).