File size: 5,478 Bytes
dc2c4e3 3c6da53 6142c70 3c6da53 dc2c4e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
language:
- tr
library_name: transformers
---
# Turkish Diacritization
The goal of this project is to present and introduce the processing of Turkish language, particularly in social media situations, by investigating the field of diacritization and developing techniques for adding diacritical marks to text in the future.
# Path Design
```
.
βββ docs
β βββ Project Proposal.pdf
βββ ner
β βββ new_df.csv
β βββ process_data.ipynb
β βββ named-entity-recognition.ipynb
β βββ getting_B.py
βββ plots
βββ tools
β βββ data_utils.py
βββ test
β βββ test-turkish-t5.ipynb
βββ train
β βββ llm-fine-tune-t5-transformer.ipynb
β βββ llm-fine-tune.ipynb
βββ README.md
```
# Dataset
[The original train dataset](https://drive.google.com/file/d/1nR-HvWjrqDT2Sf6O6ScTUSldszRW0Shm/view?usp=share_link), You can access the original train dataset from the link.
[The test dataset](https://drive.google.com/file/d/1EK5FbVii8fYmqzY2WdQ6qo7-YfQqf1a9/view?usp=sharing), You can access the test dataset from the link.
We generated negative sentences by using original sentences. Negative sentences are randomly mapping of some letters to another letter. We used this negative sentences to generate augmented dataset. We used this augmented dataset to train our model.
Character Mapping is as follows:
```python
character_mapping = {
'Δ±': 'i',
'i': 'Δ±',
'u': 'ΓΌ',
'ΓΌ': 'u',
'o': 'ΓΆ',
'ΓΆ': 'o',
'Γ§': 'c',
'c': 'Γ§',
'Δ': 'g',
'g': 'Δ',
's': 'Ε',
'Ε': 's'
}
```
You can see the dataset from the link [the augmented dataset](https://drive.google.com/file/d/1ndDUpLIm0G_BL-k1qsxpo21pS-EwnXLt/view?usp=sharing)
# NER
## Why NER?
After the diacritization process, when we look our result we saw that our model does not care about capital letter. So we decided to add additional NER layer to our transormer model. We will use BiLSTM-CRF NER model to detect the named entities and we will use this information to improve our diacritization model.
## Dataset
Firstly, we downloaded [2 Kaggle datasets](https://www.kaggle.com/datasets/cebeci/turkish-ner?select=Coarse_Grained_NER_No_NoiseReduction.csv) and process them to make appropriate for BiLSTM-CRF NER model. You can find the processed dataset from the link. [The processed NER dataset](https://drive.google.com/file/d/1v-Ye6aF8ruc-A7nq9BZZ2KOKE1qxqBPf/view?usp=sharing)
However, BiLSTM-CRF Model, that we trained, did not work well. We needed to use pretrained BERT Model.
## NER Model
Model is [Turkish Bert Classication Model](https://huggingface.co/akdeniz27/bert-base-turkish-cased-ner) which is trained on Turkish NER dataset. We used this model to detect named entities in our text.
# Model
In this project we tried two different tasks for transformers. One of them is Casual LM with BERT and other is Seq2Seq with T5. And we decided to continue with T5 model because of the better results.
## BERT Model
We used BERT model for casual language modeling. We designed our dataset according to this task and trained a pretrained BERT model. You can find the model in that link. [BERT Model](https://huggingface.co/dbmdz/bert-base-turkish-cased)
##Β T5 Model
We used T5 model for seq2seq task. We designed our dataset according to this task and trained a pretrained T5 model. You can find the model in that link. [T5 Model](https://huggingface.co/Turkish-NLP/t5-efficient-small-turkish). Our resulted model for T5 is on kaggle you can download two version of the model from that link. [T5 Model](https://www.kaggle.com/models/emirhangazi/turkish-t5).
### V1.0
This variation of model half trained with 1 million samples and without missing tokens. So we can say this model works well but there are some issues according to it's result due to missing tokens.
### V2.0
This variation of model trained with 2 million samples and with missing tokens. So we can say this model works very good but this model needs NER model to improve it's performance.
#### Training Arguments
We used the following training arguments for our model.
```python
training_args = transformers.TrainingArguments(
per_device_train_batch_size=25,
num_train_epochs=1,
warmup_steps=50,
weight_decay=0.01,
learning_rate=2e-3,
save_steps=10000,
logging_steps=10,
save_strategy='steps',
output_dir="/kaggle/working/turkish2",
lr_scheduler_type="cosine",
)
```
# Training Plots
The loss and learning rate plots for the T5 model are given below.



# Model Evaluation
We evaluated our model with the provided test dataset. While we are testing our model, we also added NER model to our pipeline. We used NER model to detect named entities in our text and we used this information to improve our diacritization model. Evaluation function is given as follow:
```python
def acc_overall(test_result, testgold):
correct = 0
total = 0
# count number of correctly diacritized words
for i in range(len(testgold)):
for m in range(len(testgold[i].split())):
if test_result[i].split()[m] == testgold[i].split()[m]:
correct += 1
total +=1
return correct / total
```
Our model's accuracy on test dataset is <b>%94.03</b>. We can say that our model works well. |