language: id | |
datasets: | |
- oscar | |
# IndoBERT (Indonesian BERT Model) | |
## Model description | |
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. | |
This model is base-uncased version which use bert-base config. | |
## Intended uses & limitations | |
#### How to use | |
```python | |
from transformers import AutoTokenizer, AutoModel | |
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT") | |
model = AutoModel.from_pretrained("sarahlintang/IndoBERT") | |
tokenizer.encode("hai aku mau makan.") | |
[2, 8078, 1785, 2318, 1946, 18, 4] | |
``` | |
## Training data | |
This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/). | |
This model is equal to bert-base model which has 32,000 vocabulary size. | |
## Training procedure | |
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. | |
We used a Google Cloud Storage bucket, for persistent storage of training data and models. | |
## Eval results | |
We evaluate this model on three Indonesian NLP downstream task: | |
- some extractive summarization model | |
- sentiment analysis | |
- Part-of-Speech Tagger | |
it was proven that this model outperforms multilingual BERT for all downstream tasks. | |