lingvanex-mt-en-ku / README.md
lingvanex's picture
Update README.md
8433e2b
|
raw
history blame
2.08 kB
metadata
language:
  - en
  - ku
tags:
  - translation
  - ctranslate2
license: cc-by-nc-4.0

Introduction

This is a English - Kurdish machine translation model

Demo

You can try this translator in Space

Metrics

  • Model performance measures: English - Kurdish model was evaluated using SacreBLEU, TER, and chrF++ metrics widely adopted by machine translation community.

Evaluation Data

  • Datasets: Lingvanex dataset is described in Section 4
  • Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200
  • Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB-200.

Training Data

  • We used parallel multilingual data from a variety of sources to train the model. We provide detailed report on data selection and construction process in Section 5 in the paper. We also used monolingual data constructed from Common Crawl. We provide more details in Section 5.2.

Intended Use

  • Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages. Information on how to - use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
  • Primary intended users: Primary users are researchers and machine translation research community.
  • Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.