|
--- |
|
language: |
|
- en |
|
- cy |
|
license: apache-2.0 |
|
pipeline_tag: translation |
|
tags: |
|
- translation |
|
- marian |
|
metrics: |
|
- bleu |
|
- cer |
|
- chrf |
|
- cer |
|
- wer |
|
- wil |
|
- wip |
|
widget: |
|
- text: >- |
|
The Curriculum and Assessment (Wales) Act 2021 (the Act) established the |
|
Curriculum for Wales and replaced the general curriculum used up until that |
|
point. |
|
example_title: Example 1 |
|
model-index: |
|
- name: mt-dspec-legislation-en-cy |
|
results: |
|
- task: |
|
name: Translation |
|
type: translation |
|
dataset: |
|
name: "various" |
|
type: "text" |
|
metrics: |
|
- type: bleu |
|
value: 65.51 |
|
- type: cer |
|
value: 0.28 |
|
- type: chrf |
|
value: 74.69 |
|
- type: wer |
|
value: 0.39 |
|
- type: wil |
|
value: 0.54 |
|
- type: wip |
|
value: 0.46 |
|
--- |
|
# mt-dspec-legislation-en-cy |
|
A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation. |
|
|
|
This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), |
|
the datasets prepared were generated from the following sources: |
|
- [UK Government Legislation data](https://www.legislation.gov.uk) |
|
- [OPUS-cy-en](https://opus.nlpl.eu/) |
|
- [Cofnod Y Cynulliad](https://record.assembly.wales/) |
|
- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) |
|
|
|
The data was split into train, validation and test sets; the test set containing legislation-specific segments were selected randomly from TMX files |
|
originating from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain, |
|
and data files scraped from the UK Government Legislation website. |
|
|
|
Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions. |
|
|
|
## Evaluation |
|
|
|
Evaluation scores were produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/). |
|
|
|
## Usage |
|
|
|
Ensure you have the prerequisite python libraries installed: |
|
|
|
```bash |
|
# The constraint imposed on the transformers version below |
|
# is due to the following issue: |
|
# https://github.com/huggingface/transformers/issues/26271 |
|
pip install sentencepiece "transformers>4.26.1<=4.30.2" |
|
``` |
|
|
|
```python |
|
import trnasformers |
|
model_id = "techiaith/mt-spec-health-en-cy" |
|
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) |
|
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer) |
|
translated = translate( |
|
"The Curriculum and Assessment (Wales) Act 2021 (the Act) " |
|
"established the Curriculum for Wales and replaced the general " |
|
"curriculum used up until that point." |
|
) |
|
print(translated["translation_text"]) |
|
``` |