gpt2-azerbaijani-smallv0 model for text generation

Introduction

gpt2-azerbaijani-smallv0 is a state-of-the-art language model for Azerbaijani based on the GPT-2 small model.

It was trained on Azerbaijani Wikipedia using Transfer Learning and Fine-tuning techniques in ~ 29 hours, on one GPU - 1 x NVIDIA Tesla K80.

Model

Model	#params	Model file (pt)	Arch.	Training /Validation data (text)
gpt2-azerbaijani-smallv0	124M	652	GPT-2 small	Azerbaijani Wikipedia (110k articles / 19k articles)

epoches - 3, loss - 5.17, accuracy - 23.99%, perplexity - 95.88

How to use GPorTuguese-2 with HuggingFace (PyTorch)

The following code use PyTorch.

import torch
from transformers import GPT2LMHeadModel, AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("nijatzeynalov/gpt2-azerbaijani-small")
tokenizer.model_max_length=1024 

model_state_dict = torch.load('GPT2_pt_3epoch_lr2e-3.pth', map_location=torch.device('cpu'))
model = GPT2LMHeadModel.from_pretrained('gpt2', state_dict=model_state_dict)

model.eval()

text = "Your prompt here"
inputs = tokenizer(text, return_tensors="pt")

sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=20, 
                                top_k=10,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

Bias

The training data used for this model come from Azerbaijani Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:

Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.

Limitations

This model was developed for the purpose of research for the application of the GPT-2 model to the Azerbaijani language, and the results it produces are of very low quality due to resource limitations, the current version is not recommended for use in commercial projects.

Since my current resources are limited, I will return to this model again, I plan to improve the results:

Add more train data in Azerbaijani language; I plan to find and add 500k+ articles using various resources, not just wikipedia.
Clean the Train dataset better; Currently, due to lack of resources, cleaning is hardly done.
Running different experiments using a more powerful GPU. Only 1cycle policy for fine tuning technique was tested.
Increase the number of Epoch; With the current GPU (GPU - 1 x NVIDIA Tesla K80), 1 epoch lasts about ~9 hours ($0.90/hr). Considering the goal of the project and other resources, I found it acceptable to stop at 3 epochs.

Author

Azerbaijani GPT-2 small was trained and evaluated by Nijat Zeynalov.