t5-small-bahasa-cased

Pretrained T5 small on both standard and local language model for Malay.

Pretraining Corpus

t5-small-bahasa-cased model was pretrained on multiple tasks. Below is list of tasks we trained on,

  1. Language masking task on bahasa news, bahasa Wikipedia, bahasa Academia.edu, bahasa parliament and translated The Pile.
  2. News title prediction on bahasa news.
  3. Next sentence prediction on bahasa news, bahasa Wikipedia, bahasa Academia.edu, bahasa parliament and translated The Pile.
  4. Translated QA Natural.
  5. Text Similarity task on translated SNLI and translated MNLI.
  6. EN-MS translation.
  7. MS-EN translation.
  8. Abstractive Summarization.
  9. Knowledge Graph triples generation.
  10. Paraphrase.
  11. Social media normalization.
  12. Noisy EN-MS translation.
  13. Noisy MS-EN translation.

Preparing steps can reproduce at https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare

Pretraining details

Supported prefix

  1. soalan: {string}, trained using Natural QA.
  2. ringkasan: {string}, for abstractive summarization.
  3. tajuk: {string}, for abstractive title.
  4. parafrasa: {string}, for abstractive paraphrase.
  5. terjemah Inggeris ke Melayu: {string}, for EN-MS translation.
  6. terjemah Melayu ke Inggeris: {string}, for MS-EN translation.
  7. grafik pengetahuan: {string}, for MS text to EN Knowledge Graph triples format.
  8. ayat1: {string1} ayat2: {string2}, semantic similarity.
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including mesolitica/t5-small-bahasa-cased