answerdotai/ModernBERT-base · # Fine-tuning ModernBERT on a Large Dataset with Masked Language Modelling

This guide demonstrates how to fine-tune the ModernBERT-base model on a Dutch dataset using the code from the s-smits/modernbert-finetune repository and the Hugging Face Transformers library. We'll walk through the steps of setting up your environment, preparing the dataset, configuring the training process, and running the fine-tuning script.

Prerequisites

Hugging Face Account: You'll need a Hugging Face account to access models, datasets, and push your fine-tuned model to the Hub. Sign up here.
Hugging Face API Token: Generate a User Access Token (with "write" access) from your Hugging Face profile settings. This token will be used to authenticate your interactions with the Hugging Face Hub.
WandB Account (Optional but Recommended): Weights & Biases (WandB) is a great tool for tracking and visualizing your training runs. Create a free account at wandb.ai.
WandB API Key: If you're using WandB, get your API key from your WandB settings.
Environment: A GPU environment is strongly recommended. We suggest using the latest pytorch version.

Installation

Clone the Repository:

git clone https://github.com/s-smits/modernbert-finetune.git
cd modernbert-finetune

Install Dependencies:
```
pip install -r requirements.txt
```
This command will install all the necessary packages listed in the requirements.txt file, including torch, datasets, huggingface-hub, transformers, and wandb. It will also install the correct version of transformers from the main branch to get the latest features.

Configuration

Environment Variables:
- Set the following environment variables:
```
export HUGGINGFACE_TOKEN="your_huggingface_token"
export WANDB_API_KEY="your_wandb_api_key" # Optional
```
  Replace "your_huggingface_token" with your actual Hugging Face token and "your_wandb_api_key" with your WandB API key.
Script Parameters:
- The train.py script defines several configurable parameters. Here are some of the most important ones:
  - model_checkpoint: "answerdotai/ModernBERT-base" (default, the base ModernBERT model).
  - dataset_name: "ssmits/fineweb-2-dutch" (default, a Dutch dataset). You can change this to any other dataset on the Hugging Face Hub.
  - num_train_epochs: 1 (default). Increase for longer training, but be mindful of overfitting.
  - chunk_size: 8192 (default). Adjust based on your GPU memory.
  - gradient_accumulation_steps: 32 (default). Modify based on your desired effective batch size and GPU memory.
  - per_device_train_batch_size: 1 (default). Adjust based on your GPU memory.
  - eval_size_ratio: 0.05 (default). The proportion of the dataset used for evaluation.
  - masking_probabilities: [0.3, 0.2, 0.18, 0.16, 0.14] (default). The curriculum learning masking probabilities.
- You can modify these parameters directly in the train.py file or by using environment variables.

Running the Fine-tuning Script

Login to Hugging Face Hub:

huggingface-cli login --token $HUGGINGFACE_TOKEN

Login to WandB (Optional):
```
wandb login --relogin
```
Run the Script:
```
python train.py
```

Monitoring and Evaluation

WandB Dashboard: If you're using WandB, monitor your training progress in real-time on your WandB project dashboard.
Hugging Face Hub: Once the training is complete, your fine-tuned model will be automatically pushed to your Hugging Face Hub profile under the repository name specified in the script (repo_name).

Using Your Fine-tuned Model

You can then use your fine-tuned model for various downstream tasks using the Hugging Face Transformers library:

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "your_username/modernbert-base-dutch"  # Replace with your model name (e.g., your username and the repo name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Use the model for inference, e.g., filling in masked tokens
inputs = tokenizer("Het weer is vandaag [MASK].", return_tensors="pt")
outputs = model(**inputs)
# ... process the outputs ...

Tips and Considerations

GPU Memory: ModernBERT is a large model. Adjust chunk_size, per_device_train_batch_size, and gradient_accumulation_steps to fit your GPU's memory.
Dataset Size: The script is designed for large, streaming datasets. Adjust estimated_dataset_size_in_rows if you're using a smaller dataset.
Hyperparameter Tuning: Experiment with different hyperparameters (learning rate, masking probabilities, etc.) to find the best settings for your task.
Evaluation: The script performs periodic evaluations. You can customize the evaluation frequency using eval_interval.
Saving: The script automatically saves intermediate and final models to the Hugging Face Hub. You can adjust the saving frequency using save_interval.

Troubleshooting

CUDA Errors: If you encounter CUDA errors, reduce per_device_train_batch_size, chunk_size, or increase gradient_accumulation_steps.
Shape Errors: The StableDataCollator is designed to handle most shape-related issues. If you encounter any, ensure your dataset is properly formatted and that you're using the latest version of the transformers library.

This guide provides a comprehensive overview of how to use the provided code to fine-tune ModernBERT. Remember to adapt the instructions and parameters to your specific needs and dataset. Good luck!