Indonesian GPT-2 finetuned on Indonesian academic journals

This is the Indonesian gpt2-small model fine-tuned to abstracts of Indonesian academic journals. All training was done on a TPUv2-8 VM sponsored by TPU Research Cloud.

The demo can be found here.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='Galuh/id-journal-gpt2')
>>> set_seed(42)
>>> generator("Penelitian ini menggunakan teknik DNA barcoding untuk", max_length=30, num_return_sequences=5)

[{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk mendeteksi perubahan genetik bakteri pada udang windu. Empat tahap telah dilakukan, meliputi preparasi media untuk larva,'},
 {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk identifikasi gen pengasil flavonoid.  Data yang diperoleh dari hasil PCR diidentifikasi dengan teknik sekuensing'},
 {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk mengekstraksi fragmen DNA dari sampel kulit buaya dan tulang anjing, di mana proses ini melibatkan karakterisasi enzim yang'},
 {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk melakukan transformasi. Tahapan transformasi meliputi seleksi sel dengan urutan (2, 8, 16,..., 18) dan'},
 {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk amplifikasi genom DNA dengan menggunakan primer TG8226 dan TG806. Metode pol'}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('Galuh/id-journal-gpt2')
model = GPT2Model.from_pretrained('Galuh/id-journal-gpt2')
text = "Ubah dengan teks apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('Galuh/id-journal-gpt2')
model = TFGPT2Model.from_pretrained('Galuh/id-journal-gpt2')
text = "Ubah dengan teks apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Limitations and bias

This model is originally the Indonesian gpt2-small model, thus this model is also subject to the same limitations and bias as the original model. More detailed bias and analysis on this specific model is coming soon.

Training data

The model was trained on a dataset of Indonesian journals. We only trained this model on the abstracts. We extract the abstract by writing a script to find any text that is located between the word "Abstrak" (abstract) and "Kata kunci" (keywords). The extraction script can be found here. To separate each abstract, we also add an end of text token (<|endoftext|>) between each abstract.

The information of the sub-dataset and the distribution of the training and evaluation dataset are as follows:

split	count	percentage
train	146,248	90%
validation	16,250	10%

Training procedure

The model was trained on a TPUv2-8 VM provided by TPU Research Cloud. The training duration was 2h 30m 57s.

Evaluation results

The model achieves the following results without any fine-tuning (zero-shot):

dataset	train loss	eval loss	eval perplexity
Indonesian journals dataset (abstract only)	2.913	2.855	17.37

Tracking

The training process was tracked in TensorBoard.

Galuh
/

id-journal-gpt2