Turkish Diacritization
The goal of this project is to present and introduce the processing of Turkish language, particularly in social media situations, by investigating the field of diacritization and developing techniques for adding diacritical marks to text in the future.
Path Design
.
βββ docs
β βββ Project Proposal.pdf
βββ ner
β βββ new_df.csv
β βββ process_data.ipynb
β βββ named-entity-recognition.ipynb
β βββ getting_B.py
βββ plots
βββ tools
β βββ data_utils.py
βββ test
β βββ test-turkish-t5.ipynb
βββ train
β βββ llm-fine-tune-t5-transformer.ipynb
β βββ llm-fine-tune.ipynb
βββ README.md
Dataset
The original train dataset, You can access the original train dataset from the link.
The test dataset, You can access the test dataset from the link.
We generated negative sentences by using original sentences. Negative sentences are randomly mapping of some letters to another letter. We used this negative sentences to generate augmented dataset. We used this augmented dataset to train our model.
Character Mapping is as follows:
character_mapping = {
'Δ±': 'i',
'i': 'Δ±',
'u': 'ΓΌ',
'ΓΌ': 'u',
'o': 'ΓΆ',
'ΓΆ': 'o',
'Γ§': 'c',
'c': 'Γ§',
'Δ': 'g',
'g': 'Δ',
's': 'Ε',
'Ε': 's'
}
You can see the dataset from the link the augmented dataset
NER
Why NER?
After the diacritization process, when we look our result we saw that our model does not care about capital letter. So we decided to add additional NER layer to our transormer model. We will use BiLSTM-CRF NER model to detect the named entities and we will use this information to improve our diacritization model.
Dataset
Firstly, we downloaded 2 Kaggle datasets and process them to make appropriate for BiLSTM-CRF NER model. You can find the processed dataset from the link. The processed NER dataset
However, BiLSTM-CRF Model, that we trained, did not work well. We needed to use pretrained BERT Model.
NER Model
Model is Turkish Bert Classication Model which is trained on Turkish NER dataset. We used this model to detect named entities in our text.
Model
In this project we tried two different tasks for transformers. One of them is Casual LM with BERT and other is Seq2Seq with T5. And we decided to continue with T5 model because of the better results.
BERT Model
We used BERT model for casual language modeling. We designed our dataset according to this task and trained a pretrained BERT model. You can find the model in that link. BERT Model
T5 Model
We used T5 model for seq2seq task. We designed our dataset according to this task and trained a pretrained T5 model. You can find the model in that link. T5 Model. Our resulted model for T5 is on kaggle you can download two version of the model from that link. T5 Model.
V1.0
This variation of model half trained with 1 million samples and without missing tokens. So we can say this model works well but there are some issues according to it's result due to missing tokens.
V2.0
This variation of model trained with 2 million samples and with missing tokens. So we can say this model works very good but this model needs NER model to improve it's performance.
Training Arguments
We used the following training arguments for our model.
training_args = transformers.TrainingArguments(
per_device_train_batch_size=25,
num_train_epochs=1,
warmup_steps=50,
weight_decay=0.01,
learning_rate=2e-3,
save_steps=10000,
logging_steps=10,
save_strategy='steps',
output_dir="/kaggle/working/turkish2",
lr_scheduler_type="cosine",
)
Training Plots
The loss and learning rate plots for the T5 model are given below.
Model Evaluation
We evaluated our model with the provided test dataset. While we are testing our model, we also added NER model to our pipeline. We used NER model to detect named entities in our text and we used this information to improve our diacritization model. Evaluation function is given as follow:
def acc_overall(test_result, testgold):
correct = 0
total = 0
# count number of correctly diacritized words
for i in range(len(testgold)):
for m in range(len(testgold[i].split())):
if test_result[i].split()[m] == testgold[i].split()[m]:
correct += 1
total +=1
return correct / total
Our model's accuracy on test dataset is %94.03. We can say that our model works well.
- Downloads last month
- 6