abdulhade's picture
Upload model
9cf0c0d verified
---
datasets:
- abdulhade/TextCorpusKurdish_asosoft
language:
- ku
- en
library_name: adapter-transformers
license: mit
metrics:
- accuracy
- bleu
- meteor
pipeline_tag: translation
---
# Kurdish-English Machine Translation with Transformers
This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's `transformers` library with MarianMT.
The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference.
This model is a product of the AI Center of Kurdistan University.
## Table of Contents
- [Introduction](#introduction)
- [Requirements](#requirements)
- [Setup](#setup)
- [Pipeline Overview](#pipeline-overview)
- [Data Preparation](#data-preparation)
- [Training SentencePiece Tokenizer](#training-sentencepiece-tokenizer)
- [Model and Tokenizer Setup](#model-and-tokenizer-setup)
- [Tokenization and Dataset Preparation](#tokenization-and-dataset-preparation)
- [Training Configuration](#training-configuration)
- [Evaluation and Metrics](#evaluation-and-metrics)
- [Inference](#inference)
- [Results](#results)
- [License](#license)
## Introduction
This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions.
## Requirements
- Python 3.8+
- Hugging Face Transformers
- Datasets library
- SentencePiece
- PyTorch 1.9+
- CUDA (for GPU support)
## Setup
1. Clone the repository and install dependencies.
2. Ensure GPU availability.
3. Prepare your Kurdish-English corpus in CSV format.
## Pipeline Overview
### Data Preparation
1. **Corpus**: A Kurdish-English parallel corpus in CSV format with columns `Source` (Kurdish) and `Target` (English).
2. **Path Definition**: Specify the corpus path in the configuration.
### Training SentencePiece Tokenizer
- **Vocabulary Size**: 32,000
- **Source Data**: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens.
### Model and Tokenizer Setup
- **Model**: `Helsinki-NLP/opus-mt-en-mul` pre-trained MarianMT model.
- **Tokenizer**: MarianMT tokenizer aligned with the model, with source and target languages set dynamically.
### Tokenization and Dataset Preparation
- **Train-Validation Split**: 90% train, 10% validation split.
- **Maximum Sequence Length**: 128 tokens for both source and target sequences.
- **Bidirectional Tokenization**: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation.
### Training Configuration
- **Learning Rate**: 2e-5
- **Batch Size**: 4 (per device, for both training and evaluation)
- **Weight Decay**: 0.01
- **Evaluation Strategy**: Per epoch
- **Epochs**: 3
- **Logging**: Logs saved every 100 steps, with TensorBoard logging enabled
- **Output Directory**: `./results`
- **Device**: GPU 1 explicitly set
### Evaluation and Metrics
The following metrics are computed on the validation dataset:
- **BLEU**: Measures translation quality based on precision and recall of n-grams.
- **METEOR**: Considers synonymy and stem matches.
- **BERTScore**: Evaluates semantic similarity with BERT embeddings.
### Inference
Inference includes bidirectional translation capabilities:
- **Source to Target**: English to Kurdish translation.
- **Target to Source**: Kurdish to English translation.
## Results
The fine-tuned model and tokenizer are saved to `./fine-tuned-marianmt`, including evaluation metrics across BLEU, METEOR, and BERTScore.
"""
# Write the content to README.md
file_path = "/mnt/data/README.md"
with open(file_path, "w") as readme_file:
readme_file.write(readme_content)
file_path