fairseq/examples/pointer_generator/README.xsum.md · OFA-Sys/OFA-OCR at dd78d66f3d4c270073def68bae8399274aaa47a1

Training a pointer-generator model on the Extreme Summarization dataset

1. Download the Extreme Summarization data and preprocess it

Follow the instructions here to obtain the original Extreme Summarization dataset. You should have six files, {train,validation,test}.{document,summary}.

2. Create a vocabulary and extend it with source position markers

vocab_size=10000
position_markers=1000
export LC_ALL=C
cat train.document train.summary |
  tr -s '[:space:]' '\n' |
  sort |
  uniq -c |
  sort -k1,1bnr -k2 |
  head -n "$((vocab_size - 4))" |
  awk '{ print $2 " " $1 }' >dict.pg.txt
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt

This creates the file dict.pg.txt that contains the 10k most frequent words, followed by 1k source position markers:

the 4954867
. 4157552
, 3439668
to 2212159
a 1916857
of 1916820
and 1823350
...
<unk-0> 0
<unk-1> 0
<unk-2> 0
<unk-3> 0
<unk-4> 0
...

2. Preprocess the text data

./preprocess.py --source train.document --target train.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out train.pg.src --target-out train.pg.tgt
./preprocess.py --source validation.document --target validation.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out valid.pg.src --target-out valid.pg.tgt
./preprocess.py --source test.document --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out test.pg.src

The data should now contain <unk-N> tokens in place of out-of-vocabulary words.

3. Binarize the dataset:

fairseq-preprocess \
  --source-lang src \
  --target-lang tgt \
  --trainpref train.pg \
  --validpref valid.pg \
  --destdir bin \
  --workers 60 \
  --srcdict dict.pg.txt \
  --joined-dictionary

3. Train a model

total_updates=20000
warmup_updates=500
lr=0.001
max_tokens=4096
update_freq=4
pointer_layer=-2

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train bin \
    --user-dir examples/pointer_generator/pointer_generator_src \
    --max-tokens "$max_tokens" \
    --task translation \
    --source-lang src --target-lang tgt \
    --truncate-source \
    --layernorm-embedding \
    --share-all-embeddings \
    --encoder-normalize-before \
    --decoder-normalize-before \
    --required-batch-size-multiple 1 \
    --arch transformer_pointer_generator \
    --alignment-layer "$pointer_layer" \
    --alignment-heads 1 \
    --source-position-markers 1000 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
    --clip-norm 0.1 \
    --lr-scheduler inverse_sqrt --lr "$lr" --max-update "$total_updates" --warmup-updates "$warmup_updates" \
    --update-freq "$update_freq" \
    --skip-invalid-size-inputs-valid-test

Above we specify that our dictionary contains 1000 source position markers, and that we want to use one attention head from the penultimate decoder layer for pointing. It should run in 5.5 hours on one node with eight 32GB V100 GPUs. The logged messages confirm that dictionary indices above 10000 will be mapped to the <unk> embedding:

2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [src] dictionary: 11000 types
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [tgt] dictionary: 11000 types
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.src
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.tgt
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | bin valid src-tgt 11332 examples
2020-09-24 20:43:53 | INFO | fairseq.models.transformer_pg | dictionary indices from 10000 to 10999 will be mapped to 3

4. Summarize the test sequences

batch_size=32
beam_size=6
max_length=60
length_penalty=1.0

fairseq-interactive bin \
    --user-dir examples/pointer_generator/pointer_generator_src \
    --batch-size "$batch_size" \
    --task translation \
    --source-lang src --target-lang tgt \
    --path checkpoints/checkpoint_last.pt \
    --input test.pg.src \
    --buffer-size 200 \
    --max-len-a 0 \
    --max-len-b "$max_length" \
    --lenpen "$length_penalty" \
    --beam "$beam_size" \
    --skip-invalid-size-inputs-valid-test |
    tee generate.out
grep ^H generate.out | cut -f 3- >generate.hyp

Now you should have the generated sequences in generate.hyp. They contain <unk-N> tokens that the model has copied from the source sequence. In order to retrieve the original words, we need the unprocessed source sequences from test.document.

5. Process the generated output

Since we skipped too long inputs when producing generate.hyp, we also have to skip too long sequences now that we read test.document.

./postprocess.py \
    --source <(awk 'NF<1024' test.document) \
    --target generate.hyp \
    --target-out generate.hyp.processed

Now you'll find the final sequences from generate.hyp.processed, with <unk-N> replaced with the original word from the source sequence.

An example of a summarized sequence

The original source document in test.document:

de roon moved to teesside in june 2016 for an initial # 8.8 m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page .

The preprocessed source document in test.src.pg:

de <unk-1> moved to <unk-4> in june 2016 for an initial # <unk-12> m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page .

The generated summary in generate.hyp:

middlesbrough striker <unk> de <unk-1> has joined spanish side <unk> on a season-long loan .

The generated summary after postprocessing in generate.hyp.processed:

middlesbrough striker <unk> de roon has joined spanish side <unk> on a season-long loan .