Edit model card

roberta-base-bahasa-cased

Pretrained RoBERTa base language model for Malay.

Pretraining Corpus

roberta-base-bahasa-cased model was pretrained on ~400 miliion words. Below is list of data we trained on,

  1. IIUM confession, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  2. local Instagram, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  3. local news, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  4. local parliament hansards, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  5. local research papers related to kebudayaan, keagaaman and etnik, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  6. local twitter, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  7. Malay Wattpad, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean
  8. Malay Wikipedia, https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean

Pretraining details

Example using AutoModelWithLMHead

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

model = AutoModelForMaskedLM.from_pretrained('mesolitica/roberta-base-bahasa-cased')
tokenizer = AutoTokenizer.from_pretrained(
    'mesolitica/roberta-base-bahasa-cased',
    do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_mask('Permohonan Najib, anak untuk dengar isu perlembagaan <mask> .')

Output is,

[{'score': 0.3368818759918213,
  'token': 746,
  'token_str': ' negara',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan negara.'},
 {'score': 0.09646568447351456,
  'token': 598,
  'token_str': ' Malaysia',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan Malaysia.'},
 {'score': 0.029483484104275703,
  'token': 3265,
  'token_str': ' UMNO',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan UMNO.'},
 {'score': 0.026470622047781944,
  'token': 2562,
  'token_str': ' parti',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan parti.'},
 {'score': 0.023237623274326324,
  'token': 391,
  'token_str': ' ini',
  'sequence': 'Permohonan Najib, anak untuk dengar isu perlembagaan ini.'}]
Downloads last month
37
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.