Fill-Mask
Transformers
TensorBoard
Safetensors
bert
Inference Endpoints
Edit model card

NusaBERT Base

NusaBERT Base is a multilingual encoder-based language model based on the BERT architecture. We conducted continued pre-training on open-source corpora of sabilmakbar/indo_wiki, acul3/KoPI-NLLB, and uonlp/CulturaX. On a held-out subset of the corpus, our model achieved:

  • eval_accuracy: 0.6866
  • eval_loss: 1.4876
  • perplexity: 4.4266

This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. LazarusNLP/NusaBERT-base is released under Apache 2.0 license.

Model Detail

  • Developed by: LazarusNLP
  • Finetuned from: IndoBERT base p1
  • Model type: Encoder-based BERT language model
  • Language(s): Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
  • License: Apache 2.0
  • Contact: LazarusNLP

Use in 🤗Transformers

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "LazarusNLP/NusaBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Training Datasets

Around 16B tokens from the following corpora were used during pre-training.

Training Hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0003
  • train_batch_size: 256
  • eval_batch_size: 256
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 24000
  • training_steps: 500000

Framework versions

  • Transformers 4.37.2
  • Pytorch 2.2.0+cu118
  • Datasets 2.17.1
  • Tokenizers 0.15.1

Credits

NusaBERT Base is developed with love by:

Citation

@misc{wongso2024nusabert,
  title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
  author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
  year={2024},
  eprint={2403.01817},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month
37
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for LazarusNLP/NusaBERT-base

Finetunes
6 models

Datasets used to train LazarusNLP/NusaBERT-base

Collection including LazarusNLP/NusaBERT-base