IndoBERT (Indonesian BERT Model)
Model description
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.
This model is base-uncased version which use bert-base config.
Intended uses & limitations
How to use
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]
Training data
This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).
This model is equal to bert-base model which has 32,000 vocabulary size.
Training procedure
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. We used a Google Cloud Storage bucket, for persistent storage of training data and models.
Eval results
We evaluate this model on three Indonesian NLP downstream task:
- some extractive summarization model
- sentiment analysis
- Part-of-Speech Tagger it was proven that this model outperforms multilingual BERT for all downstream tasks.
- Downloads last month
- 82