ODABert / README.md
alex-miller's picture
Update README.md
5a466c9 verified
metadata
license: apache-2.0
base_model: bert-base-multilingual-uncased
tags:
  - generated_from_trainer
model-index:
  - name: ODABert
    results: []
datasets:
  - alex-miller/oecd-dac-crs
widget:
  - text: Official Development [MASK].
    example_title: ODA
  - text: Climate adaptation and climate [MASK].
    example_title: Climate

ODABert

This model is a fine-tuned version of bert-base-multilingual-uncased on the OECD DAC CRS project titles and descriptions dataset. It achieves the following results on the evaluation set:

  • Loss: 0.9961

Model description

A 3 epoch fine-tune of BERT base multilingual uncased on development and humanitarian finance project titles and descriptions from the OECD DAC CRS. Vocabulary of the base model was expanded by 1,059 tokens (1% increase) based on the most prevalent tokens in the CRS that were not present in the base model vocabulary.

Intended uses & limitations

Developed as an experiment to see whether fine-tuning on the CRS would help improve classifier models built on top of this MLM. Although it's built on a multilingual model, an the finetuning texts do include other languages, English will be the most prevalent.

Training and evaluation data

See the OECD DAC CRS project titles and descriptions dataset.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3.0

Training results

Training Loss Epoch Step Validation Loss
1.2133 1.0 58119 1.1296
1.098 2.0 116238 1.0336
1.0441 3.0 174357 0.9958

Framework versions

  • Transformers 4.38.2
  • Pytorch 2.0.1
  • Datasets 2.18.0
  • Tokenizers 0.15.2