File size: 3,796 Bytes
585d48b 9cf0c0d 585d48b 22c7081 8adc58d 22c7081 e12f6d6 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 4dc25cf 22c7081 585d48b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
datasets:
- abdulhade/TextCorpusKurdish_asosoft
language:
- ku
- en
library_name: adapter-transformers
license: mit
metrics:
- accuracy
- bleu
- meteor
pipeline_tag: translation
---
# Kurdish-English Machine Translation with Transformers
This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's `transformers` library with MarianMT.
The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference.
This model is a product of the AI Center of Kurdistan University.
## Table of Contents
- [Introduction](#introduction)
- [Requirements](#requirements)
- [Setup](#setup)
- [Pipeline Overview](#pipeline-overview)
- [Data Preparation](#data-preparation)
- [Training SentencePiece Tokenizer](#training-sentencepiece-tokenizer)
- [Model and Tokenizer Setup](#model-and-tokenizer-setup)
- [Tokenization and Dataset Preparation](#tokenization-and-dataset-preparation)
- [Training Configuration](#training-configuration)
- [Evaluation and Metrics](#evaluation-and-metrics)
- [Inference](#inference)
- [Results](#results)
- [License](#license)
## Introduction
This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions.
## Requirements
- Python 3.8+
- Hugging Face Transformers
- Datasets library
- SentencePiece
- PyTorch 1.9+
- CUDA (for GPU support)
## Setup
1. Clone the repository and install dependencies.
2. Ensure GPU availability.
3. Prepare your Kurdish-English corpus in CSV format.
## Pipeline Overview
### Data Preparation
1. **Corpus**: A Kurdish-English parallel corpus in CSV format with columns `Source` (Kurdish) and `Target` (English).
2. **Path Definition**: Specify the corpus path in the configuration.
### Training SentencePiece Tokenizer
- **Vocabulary Size**: 32,000
- **Source Data**: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens.
### Model and Tokenizer Setup
- **Model**: `Helsinki-NLP/opus-mt-en-mul` pre-trained MarianMT model.
- **Tokenizer**: MarianMT tokenizer aligned with the model, with source and target languages set dynamically.
### Tokenization and Dataset Preparation
- **Train-Validation Split**: 90% train, 10% validation split.
- **Maximum Sequence Length**: 128 tokens for both source and target sequences.
- **Bidirectional Tokenization**: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation.
### Training Configuration
- **Learning Rate**: 2e-5
- **Batch Size**: 4 (per device, for both training and evaluation)
- **Weight Decay**: 0.01
- **Evaluation Strategy**: Per epoch
- **Epochs**: 3
- **Logging**: Logs saved every 100 steps, with TensorBoard logging enabled
- **Output Directory**: `./results`
- **Device**: GPU 1 explicitly set
### Evaluation and Metrics
The following metrics are computed on the validation dataset:
- **BLEU**: Measures translation quality based on precision and recall of n-grams.
- **METEOR**: Considers synonymy and stem matches.
- **BERTScore**: Evaluates semantic similarity with BERT embeddings.
### Inference
Inference includes bidirectional translation capabilities:
- **Source to Target**: English to Kurdish translation.
- **Target to Source**: Kurdish to English translation.
## Results
The fine-tuned model and tokenizer are saved to `./fine-tuned-marianmt`, including evaluation metrics across BLEU, METEOR, and BERTScore.
"""
# Write the content to README.md
file_path = "/mnt/data/README.md"
with open(file_path, "w") as readme_file:
readme_file.write(readme_content)
file_path |