--- language: - en - cy pipeline_tag: translation tags: - translation - marian metrics: - bleu - cer - wer - wil - wip - chrf widget: - text: "The doctor will be late to attend to patients this morning." example_title: "Example 1" license: apache-2.0 model-index: - name: "mt-dspec-health-en-cy" results: - task: name: Translation type: translation dataset: type: "text" name: "various" metrics: - name: SacreBLEU type: bleu value: 54.16 - name: CER type: cer value: 0.31 - name: WER type: wer value: 0.47 - name: WIL type: wil value: 0.67 - name: WIP type: wip value: 0.33 - name: SacreBLEU CHRF type: chrf value: 69.03 --- # mt-dspec-health-en-cy A language translation model for translating between English and Welsh, specialised to the specific domain of Health and care. This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), the datasets prepared were generated from the following sources: - [UK Government Legislation data](https://www.legislation.gov.uk) - [OPUS-cy-en](https://opus.nlpl.eu/) - [Cofnod Y Cynulliad](https://record.assembly.wales/) - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) The data was split into train, validation and tests sets, the test set containing health-specific segments from TMX files selected at random from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain. Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions. A website demonstrating use of this model is available at http://cyfieithu.techiaith.cymru. ## Evaluation Evaluation was done using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/). ## Usage Ensure you have the prerequisite python libraries installed: ```bash # The constraint imposed on the transformers version below # is due to the following issue: # https://github.com/huggingface/transformers/issues/26271 pip install sentencepiece "transformers>4.26.1<=4.30.2" ``` ```python import trnasformers model_id = "techiaith/mt-spec-health-en-cy" tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id) translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer) translated = translate("The doctor will be late to attend to patients this morning.") print(translated["translation_text"]) ```