language:
- id
inference: false
tags:
- BERT
- HPLT
- encoder
license: apache-2.0
datasets:
- HPLT/hplt_monolingual_v1_2
HPLT Bert for Indonesian
This is one of the encoder-only monolingual language models trained as a first release by the HPLT project. It is a so called masked language models. In particular, we used the modification of the classic BERT model named LTG-BERT.
A monolingual LTG-BERT model is trained for every major language in the HPLT 1.2 data release (75 models total).
All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
- hidden size: 768
- attention heads: 12
- layers: 12
- vocabulary size: 32768
Every model uses its own tokenizer trained on language-specific HPLT data. See sizes of the training corpora, evaluation results and more in our language model training report.
The training statistics of all 75 runs
Example usage
This model currently needs a custom wrapper from modeling_ltgbert.py
, you should therefore load the model with trust_remote_code=True
.
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", trust_remote_code=True)
mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)
# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))
The following classes are currently implemented: AutoModel
, AutoModelMaskedLM
, AutoModelForSequenceClassification
, AutoModelForTokenClassification
, AutoModelForQuestionAnswering
and AutoModeltForMultipleChoice
.
Cite us
@misc{degibert2024new,
title={A New Massive Multilingual Dataset for High-Performance Language Technologies},
author={Ona de Gibert and Graeme Nail and Nikolay Arefyev and Marta Bañón and Jelmer van der Linde and Shaoxiong Ji and Jaume Zaragoza-Bernabeu and Mikko Aulamo and Gema Ramírez-Sánchez and Andrey Kutuzov and Sampo Pyysalo and Stephan Oepen and Jörg Tiedemann},
year={2024},
eprint={2403.14009},
archivePrefix={arXiv},
primaryClass={cs.CL}
}