|
--- |
|
license: apache-2.0 |
|
language: |
|
- fr |
|
library_name: transformers |
|
tags: |
|
- nllb |
|
- commonvoice |
|
- pytorch |
|
- pictograms |
|
- translation |
|
metrics: |
|
- bleu |
|
inference: false |
|
--- |
|
|
|
# t2p-nllb-200-distilled-600M-commonvoice |
|
|
|
*t2p-nllb-200-distilled-600M-commonvoice* is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)). |
|
The model is used only for **inference**. |
|
|
|
## Training details |
|
|
|
### Datasets |
|
|
|
The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus. |
|
This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets. |
|
| **Split** | **Number of utterances** | |
|
|:-----------:|:-----------------------:| |
|
| train | 527,390 | |
|
| valid | 16,124 | |
|
| test | 16,120 | |
|
|
|
### Parameters |
|
|
|
A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline : |
|
|
|
```python |
|
training_args = Seq2SeqTrainingArguments( |
|
output_dir="checkpoints_commonvoice/", |
|
evaluation_strategy="epoch", |
|
save_strategy="epoch", |
|
learning_rate=2e-5, |
|
per_device_train_batch_size=32, |
|
per_device_eval_batch_size=32, |
|
weight_decay=0.01, |
|
save_total_limit=3, |
|
num_train_epochs=40, |
|
predict_with_generate=True, |
|
fp16=True, |
|
load_best_model_at_end=True |
|
) |
|
``` |
|
|
|
### Evaluation |
|
|
|
The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis. |
|
|
|
### Results |
|
|
|
Comparison to other translation models : |
|
| **Model** | **validation** | **test** | |
|
|:-----------:|:-----------------------:|:-----------------------:| |
|
| t2p-t5-large-commonvoice | 86.3 | 86.5 | |
|
| t2p-nmt-commonvoice | 86.0 | 82.6 | |
|
| t2p-mbart-large-cc25-commonvoice | 72.3 | 72.3 | |
|
| **t2p-nllb-200-distilled-600M-commonvoice** | **87.4** | **87.6** | |
|
|
|
### Environmental Impact |
|
|
|
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 30 hours in total. |
|
|
|
## Using t2p-nllb-200-distilled-600M-commonvoice model with HuggingFace transformers |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
source_lang = "fr" |
|
target_lang = "frp" |
|
max_input_length = 128 |
|
max_target_length = 128 |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-commonvoice") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-commonvoice") |
|
|
|
inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids |
|
outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) |
|
pred = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms |
|
|
|
```python |
|
import pandas as pd |
|
|
|
def process_output_trad(pred): |
|
return pred.split() |
|
|
|
def read_lexicon(lexicon): |
|
df = pd.read_csv(lexicon, sep='\t') |
|
df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_') |
|
return df |
|
|
|
def get_id_picto_from_predicted_lemma(df_lexicon, lemma): |
|
id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist() |
|
return (id_picto[0], lemma) if id_picto else (0, lemma) |
|
|
|
lexicon = read_lexicon("lexicon.csv") |
|
sentence_to_map = process_output_trad(pred) |
|
pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map] |
|
``` |
|
|
|
## Viewing the predicted sequence of ARASAAC pictograms in a HTML file |
|
|
|
```python |
|
def generate_html(ids): |
|
html_content = '<html><body>' |
|
for picto_id, lemma in ids: |
|
if picto_id != 0: # ignore invalid IDs |
|
img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png" |
|
html_content += f''' |
|
<figure style="display:inline-block; margin:1px;"> |
|
<img src="{img_url}" alt="{lemma}" width="200" height="200" /> |
|
<figcaption>{lemma}</figcaption> |
|
</figure> |
|
''' |
|
html_content += '</body></html>' |
|
return html_content |
|
|
|
html = generate_html(pictogram_ids) |
|
with open("pictograms.html", "w") as file: |
|
file.write(html) |
|
``` |
|
|
|
## Information |
|
|
|
- **Language(s):** French |
|
- **License:** Apache-2.0 |
|
- **Developed by:** Cécile Macaire |
|
- **Funded by** |
|
- GENCI-IDRIS (Grant 2023-AD011013625R1) |
|
- PROPICTO ANR-20-CE93-0005 |
|
- **Authors** |
|
- Cécile Macaire |
|
- Chloé Dion |
|
- Emmanuelle Esperança-Rodier |
|
- Benjamin Lecouteux |
|
- Didier Schwab |
|
|
|
|
|
## Citation |
|
|
|
If you use this model for your own research work, please cite as follows: |
|
|
|
```bibtex |
|
@inproceedings{macaire_jeptaln2024, |
|
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}}, |
|
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle}, |
|
url = {https://inria.hal.science/hal-04623007}, |
|
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}}, |
|
address = {Toulouse, France}, |
|
publisher = {{ATALA \& AFPC}}, |
|
volume = {1 : articles longs et prises de position}, |
|
pages = {22-35}, |
|
year = {2024} |
|
} |
|
``` |
|
|