|
--- |
|
datasets: |
|
- abdulhade/TextCorpusKurdish_asosoft |
|
language: |
|
- ku |
|
- en |
|
library_name: adapter-transformers |
|
license: mit |
|
metrics: |
|
- accuracy |
|
- bleu |
|
- meteor |
|
pipeline_tag: translation |
|
--- |
|
|
|
# Kurdish-English Machine Translation with Transformers |
|
|
|
This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's `transformers` library with MarianMT. |
|
The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference. |
|
This model is a product of the AI Center of Kurdistan University. |
|
## Table of Contents |
|
|
|
- [Introduction](#introduction) |
|
- [Requirements](#requirements) |
|
- [Setup](#setup) |
|
- [Pipeline Overview](#pipeline-overview) |
|
- [Data Preparation](#data-preparation) |
|
- [Training SentencePiece Tokenizer](#training-sentencepiece-tokenizer) |
|
- [Model and Tokenizer Setup](#model-and-tokenizer-setup) |
|
- [Tokenization and Dataset Preparation](#tokenization-and-dataset-preparation) |
|
- [Training Configuration](#training-configuration) |
|
- [Evaluation and Metrics](#evaluation-and-metrics) |
|
- [Inference](#inference) |
|
- [Results](#results) |
|
- [License](#license) |
|
|
|
## Introduction |
|
|
|
This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions. |
|
|
|
## Requirements |
|
|
|
- Python 3.8+ |
|
- Hugging Face Transformers |
|
- Datasets library |
|
- SentencePiece |
|
- PyTorch 1.9+ |
|
- CUDA (for GPU support) |
|
|
|
## Setup |
|
|
|
1. Clone the repository and install dependencies. |
|
2. Ensure GPU availability. |
|
3. Prepare your Kurdish-English corpus in CSV format. |
|
|
|
## Pipeline Overview |
|
|
|
### Data Preparation |
|
|
|
1. **Corpus**: A Kurdish-English parallel corpus in CSV format with columns `Source` (Kurdish) and `Target` (English). |
|
2. **Path Definition**: Specify the corpus path in the configuration. |
|
|
|
### Training SentencePiece Tokenizer |
|
|
|
- **Vocabulary Size**: 32,000 |
|
- **Source Data**: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens. |
|
|
|
### Model and Tokenizer Setup |
|
|
|
- **Model**: `Helsinki-NLP/opus-mt-en-mul` pre-trained MarianMT model. |
|
- **Tokenizer**: MarianMT tokenizer aligned with the model, with source and target languages set dynamically. |
|
|
|
### Tokenization and Dataset Preparation |
|
|
|
- **Train-Validation Split**: 90% train, 10% validation split. |
|
- **Maximum Sequence Length**: 128 tokens for both source and target sequences. |
|
- **Bidirectional Tokenization**: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation. |
|
|
|
### Training Configuration |
|
|
|
- **Learning Rate**: 2e-5 |
|
- **Batch Size**: 4 (per device, for both training and evaluation) |
|
- **Weight Decay**: 0.01 |
|
- **Evaluation Strategy**: Per epoch |
|
- **Epochs**: 3 |
|
- **Logging**: Logs saved every 100 steps, with TensorBoard logging enabled |
|
- **Output Directory**: `./results` |
|
- **Device**: GPU 1 explicitly set |
|
|
|
### Evaluation and Metrics |
|
|
|
The following metrics are computed on the validation dataset: |
|
- **BLEU**: Measures translation quality based on precision and recall of n-grams. |
|
- **METEOR**: Considers synonymy and stem matches. |
|
- **BERTScore**: Evaluates semantic similarity with BERT embeddings. |
|
|
|
### Inference |
|
|
|
Inference includes bidirectional translation capabilities: |
|
- **Source to Target**: English to Kurdish translation. |
|
- **Target to Source**: Kurdish to English translation. |
|
|
|
## Results |
|
|
|
The fine-tuned model and tokenizer are saved to `./fine-tuned-marianmt`, including evaluation metrics across BLEU, METEOR, and BERTScore. |
|
""" |
|
|
|
# Write the content to README.md |
|
file_path = "/mnt/data/README.md" |
|
with open(file_path, "w") as readme_file: |
|
readme_file.write(readme_content) |
|
|
|
file_path |