Spaces:
Runtime error
Runtime error
# Pretraining RoBERTa using your own data | |
This tutorial will walk you through pretraining RoBERTa over your own data. | |
### 1) Preprocess the data | |
Data should be preprocessed following the [language modeling format](/examples/language_model), i.e. each document should be separated by an empty line (only useful with `--sample-break-mode complete_doc`). Lines will be concatenated as a 1D text stream during training. | |
We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) | |
to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course | |
this dataset is quite small, so the resulting pretrained model will perform | |
poorly, but it gives the general idea. | |
First download the dataset: | |
```bash | |
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip | |
unzip wikitext-103-raw-v1.zip | |
``` | |
Next encode it with the GPT-2 BPE: | |
```bash | |
mkdir -p gpt2_bpe | |
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json | |
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe | |
for SPLIT in train valid test; do \ | |
python -m examples.roberta.multiprocessing_bpe_encoder \ | |
--encoder-json gpt2_bpe/encoder.json \ | |
--vocab-bpe gpt2_bpe/vocab.bpe \ | |
--inputs wikitext-103-raw/wiki.${SPLIT}.raw \ | |
--outputs wikitext-103-raw/wiki.${SPLIT}.bpe \ | |
--keep-empty \ | |
--workers 60; \ | |
done | |
``` | |
Finally preprocess/binarize the data using the GPT-2 fairseq dictionary: | |
```bash | |
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt | |
fairseq-preprocess \ | |
--only-source \ | |
--srcdict gpt2_bpe/dict.txt \ | |
--trainpref wikitext-103-raw/wiki.train.bpe \ | |
--validpref wikitext-103-raw/wiki.valid.bpe \ | |
--testpref wikitext-103-raw/wiki.test.bpe \ | |
--destdir data-bin/wikitext-103 \ | |
--workers 60 | |
``` | |
### 2) Train RoBERTa base | |
```bash | |
DATA_DIR=data-bin/wikitext-103 | |
fairseq-hydra-train -m --config-dir examples/roberta/config/pretraining \ | |
--config-name base task.data=$DATA_DIR | |
``` | |
**Note:** You can optionally resume training the released RoBERTa base model by | |
adding `checkpoint.restore_file=/path/to/roberta.base/model.pt`. | |
**Note:** The above command assumes training on 8x32GB V100 GPUs. Each GPU uses | |
a batch size of 16 sequences (`dataset.batch_size`) and accumulates gradients to | |
further increase the batch size by 16x (`optimization.update_freq`), for a total batch size | |
of 2048 sequences. If you have fewer GPUs or GPUs with less memory you may need | |
to reduce `dataset.batch_size` and increase dataset.update_freq to compensate. | |
Alternatively if you have more GPUs you can decrease `dataset.update_freq` accordingly | |
to increase training speed. | |
**Note:** The learning rate and batch size are tightly connected and need to be | |
adjusted together. We generally recommend increasing the learning rate as you | |
increase the batch size according to the following table (although it's also | |
dataset dependent, so don't rely on the following values too closely): | |
batch size | peak learning rate | |
---|--- | |
256 | 0.0001 | |
2048 | 0.0005 | |
8192 | 0.0007 | |
### 3) Load your pretrained model | |
```python | |
from fairseq.models.roberta import RobertaModel | |
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data') | |
assert isinstance(roberta.model, torch.nn.Module) | |
``` | |