Transformer with Pointer-Generator Network

This page describes the transformer_pointer_generator model that incorporates a pointing mechanism in the Transformer model that facilitates copying of input words to the output. This architecture is described in Enarvi et al. (2020).

Background

The pointer-generator network was introduced in See et al. (2017) for RNN encoder-decoder attention models. A similar mechanism can be incorporated in a Transformer model by reusing one of the many attention distributions for pointing. The attention distribution over the input words is interpolated with the normal output distribution over the vocabulary words. This allows the model to generate words that appear in the input, even if they don't appear in the vocabulary, helping especially with small vocabularies.

Implementation

The mechanism for copying out-of-vocabulary words from the input has been implemented differently to See et al. In their implementation they convey the word identities through the model in order to be able to produce words that appear in the input sequence but not in the vocabulary. A different approach was taken in the Fairseq implementation to keep it self-contained in the model file, avoiding any changes to the rest of the code base. Copying out-of-vocabulary words is possible by pre-processing the input and post-processing the output. This is described in detail in the next section.

Usage

The training and evaluation procedure is outlined below. You can also find a more detailed example for the XSum dataset on this page.

1. Create a vocabulary and extend it with source position markers

The pointing mechanism is especially helpful with small vocabularies, if we are able to recover the identities of any out-of-vocabulary words that are copied from the input. For this purpose, the model allows extending the vocabulary with special tokens that can be used in place of <unk> tokens to identify different input positions. For example, the user may add <unk-0>, <unk-1>, <unk-2>, etc. to the end of the vocabulary, after the normal words. Below is an example of how to create a vocabulary of 10000 most common words and add 1000 input position markers.

vocab_size=10000
position_markers=1000
export LC_ALL=C
cat train.src train.tgt |
  tr -s '[:space:]' '\n' |
  sort |
  uniq -c |
  sort -k1,1bnr -k2 |
  head -n "$((vocab_size - 4))" |
  awk '{ print $2 " " $1 }' >dict.pg.txt
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt

2. Preprocess the text data

The idea is that any <unk> tokens in the text are replaced with <unk-0> if it appears in the first input position, <unk-1> if it appears in the second input position, and so on. This can be achieved using the preprocess.py script that is provided in this directory.

3. Train a model

The number of these special tokens is given to the model with the --source-position-markers argument—the model simply maps all of these to the same word embedding as <unk>.

The attention distribution that is used for pointing is selected using the --alignment-heads and --alignment-layer command-line arguments in the same way as with the transformer_align model.

4. Generate text and postprocess it

When using the model to generate text, you want to preprocess the input text in the same way that training data was processed, replacing out-of-vocabulary words with <unk-N> tokens. If any of these tokens are copied to the output, the actual words can be retrieved from the unprocessed input text. Any <unk-N> token should be replaced with the word at position N in the original input sequence. This can be achieved using the postprocess.py script.