Spaces:
Runtime error
Pretraining RoBERTa using your own data
This tutorial will walk you through pretraining RoBERTa over your own data.
1) Preprocess the data
Data should be preprocessed following the language modeling format, i.e. each document should be separated by an empty line (only useful with --sample-break-mode complete_doc
). Lines will be concatenated as a 1D text stream during training.
We'll use the WikiText-103 dataset to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course this dataset is quite small, so the resulting pretrained model will perform poorly, but it gives the general idea.
First download the dataset:
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
Next encode it with the GPT-2 BPE:
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json gpt2_bpe/encoder.json \
--vocab-bpe gpt2_bpe/vocab.bpe \
--inputs wikitext-103-raw/wiki.${SPLIT}.raw \
--outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
--keep-empty \
--workers 60; \
done
Finally preprocess/binarize the data using the GPT-2 fairseq dictionary:
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
fairseq-preprocess \
--only-source \
--srcdict gpt2_bpe/dict.txt \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60
2) Train RoBERTa base
DATA_DIR=data-bin/wikitext-103
fairseq-hydra-train -m --config-dir examples/roberta/config/pretraining \
--config-name base task.data=$DATA_DIR
Note: You can optionally resume training the released RoBERTa base model by
adding checkpoint.restore_file=/path/to/roberta.base/model.pt
.
Note: The above command assumes training on 8x32GB V100 GPUs. Each GPU uses
a batch size of 16 sequences (dataset.batch_size
) and accumulates gradients to
further increase the batch size by 16x (optimization.update_freq
), for a total batch size
of 2048 sequences. If you have fewer GPUs or GPUs with less memory you may need
to reduce dataset.batch_size
and increase dataset.update_freq to compensate.
Alternatively if you have more GPUs you can decrease dataset.update_freq
accordingly
to increase training speed.
Note: The learning rate and batch size are tightly connected and need to be adjusted together. We generally recommend increasing the learning rate as you increase the batch size according to the following table (although it's also dataset dependent, so don't rely on the following values too closely):
batch size | peak learning rate |
---|---|
256 | 0.0001 |
2048 | 0.0005 |
8192 | 0.0007 |
3) Load your pretrained model
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data')
assert isinstance(roberta.model, torch.nn.Module)