File size: 3,796 Bytes
585d48b
 
 
 
 
 
9cf0c0d
 
585d48b
 
 
 
 
 
22c7081
8adc58d
22c7081
e12f6d6
 
 
4dc25cf
22c7081
4dc25cf
 
 
 
 
 
 
 
 
 
 
 
 
22c7081
4dc25cf
22c7081
4dc25cf
22c7081
4dc25cf
22c7081
4dc25cf
 
 
 
 
 
22c7081
4dc25cf
22c7081
4dc25cf
 
 
22c7081
4dc25cf
22c7081
4dc25cf
22c7081
4dc25cf
 
22c7081
4dc25cf
22c7081
4dc25cf
 
22c7081
4dc25cf
22c7081
4dc25cf
 
22c7081
4dc25cf
22c7081
4dc25cf
 
 
22c7081
4dc25cf
22c7081
4dc25cf
 
 
 
 
 
 
 
22c7081
4dc25cf
22c7081
4dc25cf
 
 
 
22c7081
4dc25cf
22c7081
4dc25cf
 
 
22c7081
4dc25cf
22c7081
4dc25cf
 
22c7081
4dc25cf
 
 
 
22c7081
585d48b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
datasets:
- abdulhade/TextCorpusKurdish_asosoft
language:
- ku
- en
library_name: adapter-transformers
license: mit
metrics:
- accuracy
- bleu
- meteor
pipeline_tag: translation
---

# Kurdish-English Machine Translation with  Transformers

This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's `transformers` library with MarianMT. 
The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference.
This model is a product of the AI Center of Kurdistan University.
## Table of Contents

- [Introduction](#introduction)
- [Requirements](#requirements)
- [Setup](#setup)
- [Pipeline Overview](#pipeline-overview)
  - [Data Preparation](#data-preparation)
  - [Training SentencePiece Tokenizer](#training-sentencepiece-tokenizer)
  - [Model and Tokenizer Setup](#model-and-tokenizer-setup)
  - [Tokenization and Dataset Preparation](#tokenization-and-dataset-preparation)
  - [Training Configuration](#training-configuration)
  - [Evaluation and Metrics](#evaluation-and-metrics)
  - [Inference](#inference)
- [Results](#results)
- [License](#license)

## Introduction

This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions.

## Requirements

- Python 3.8+
- Hugging Face Transformers
- Datasets library
- SentencePiece
- PyTorch 1.9+
- CUDA (for GPU support)

## Setup

1. Clone the repository and install dependencies.
2. Ensure GPU availability.
3. Prepare your Kurdish-English corpus in CSV format.

## Pipeline Overview

### Data Preparation

1. **Corpus**: A Kurdish-English parallel corpus in CSV format with columns `Source` (Kurdish) and `Target` (English).
2. **Path Definition**: Specify the corpus path in the configuration.

### Training SentencePiece Tokenizer

- **Vocabulary Size**: 32,000
- **Source Data**: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens.

### Model and Tokenizer Setup

- **Model**: `Helsinki-NLP/opus-mt-en-mul` pre-trained MarianMT model.
- **Tokenizer**: MarianMT tokenizer aligned with the model, with source and target languages set dynamically.

### Tokenization and Dataset Preparation

- **Train-Validation Split**: 90% train, 10% validation split.
- **Maximum Sequence Length**: 128 tokens for both source and target sequences.
- **Bidirectional Tokenization**: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation.

### Training Configuration

- **Learning Rate**: 2e-5
- **Batch Size**: 4 (per device, for both training and evaluation)
- **Weight Decay**: 0.01
- **Evaluation Strategy**: Per epoch
- **Epochs**: 3
- **Logging**: Logs saved every 100 steps, with TensorBoard logging enabled
- **Output Directory**: `./results`
- **Device**: GPU 1 explicitly set

### Evaluation and Metrics

The following metrics are computed on the validation dataset:
- **BLEU**: Measures translation quality based on precision and recall of n-grams.
- **METEOR**: Considers synonymy and stem matches.
- **BERTScore**: Evaluates semantic similarity with BERT embeddings.

### Inference

Inference includes bidirectional translation capabilities:
- **Source to Target**: English to Kurdish translation.
- **Target to Source**: Kurdish to English translation.

## Results

The fine-tuned model and tokenizer are saved to `./fine-tuned-marianmt`, including evaluation metrics across BLEU, METEOR, and BERTScore.
"""

# Write the content to README.md
file_path = "/mnt/data/README.md"
with open(file_path, "w") as readme_file:
    readme_file.write(readme_content)

file_path