YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

RoBERTa Pretrained on Smaller Datasets

We pretrain RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). We release 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: We combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.

Hyperparameters and Validation Perplexity

The hyperparameters and validation perplexities corresponding to each model are as follows:

Model Name Training Size Model Size Max Steps Batch Size Validation Perplexity
roberta-base-1B-1 1B BASE 100K 512 3.93
roberta-base-1B-2 1B BASE 31K 1024 4.25
roberta-base-1B-3 1B BASE 31K 4096 3.84
roberta-base-100M-1 100M BASE 100K 512 4.99
roberta-base-100M-2 100M BASE 31K 1024 4.61
roberta-base-100M-3 100M BASE 31K 512 5.02
roberta-base-10M-1 10M BASE 10K 1024 11.31
roberta-base-10M-2 10M BASE 10K 512 10.78
roberta-base-10M-3 10M BASE 31K 512 11.58
roberta-med-small-1M-1 1M MED-SMALL 100K 512 153.38
roberta-med-small-1M-2 1M MED-SMALL 10K 512 134.18
roberta-med-small-1M-3 1M MED-SMALL 31K 512 139.39

The hyperparameters corresponding to model sizes mentioned above are as follows:

Model Size L AH HS FFN P
BASE 12 12 768 3072 125M
MED-SMALL 6 8 512 2048 45M

(AH = number of attention heads; HS = hidden size; FFN = feedforward network dimension; P = number of parameters.)

For other hyperparameters, we select:

  • Peak Learning rate: 5e-4
  • Warmup Steps: 6% of max steps
  • Dropout: 0.1
Downloads last month
1,147
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.