--- license: apache-2.0 language: - fr library_name: transformers tags: - t5 - orfeo - pytorch - pictograms - translation metrics: - bleu widget: - text: "je mange une pomme" example_title: "A simple sentence" - text: "je ne pense pas à toi" example_title: "Sentence with a negation" - text: "il y a 2 jours, les gendarmes ont vérifié ma licence" example_title: "Sentence with a polylexical term" --- # t2p-t5-large-orféo *t2p-t5-large-orféo* is a text-to-pictograms translation model built by fine-tuning the [t5-large](https://huggingface.co/google-t5/t5-large) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)). The model is used only for **inference**. ## Training details ### Datasets The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus. This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets. | **Split** | **Number of utterances** | |:-----------:|:-----------------------:| | train | 231,374 | | valid | 28,796 | | test | 29,009 | ### Parameters A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline : ```python training_args = Seq2SeqTrainingArguments( output_dir="checkpoints_orfeo/", evaluation_strategy="epoch", save_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=32, per_device_eval_batch_size=32, weight_decay=0.01, save_total_limit=3, num_train_epochs=40, predict_with_generate=True, fp16=True, load_best_model_at_end=True ) ``` ### Evaluation The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis. ### Results Comparison to other translation models : | **Model** | **validation** | **test** | |:-----------:|:-----------------------:|:-----------------------:| | **t2p-t5-large-orféo** | 85.2 | 85.8 | | t2p-nmt-orféo | **87.2** | **87.4** | | t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 | | t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 | ### Environmental Impact Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 16 hours in total. ## Using t2p-t5-large-orféo model with HuggingFace transformers ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM source_lang = "fr" target_lang = "frp" max_input_length = 128 max_target_length = 128 tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-t5-large-orfeo") model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-t5-large-orfeo") inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) pred = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms ```python import pandas as pd def process_output_trad(pred): return pred.split() def read_lexicon(lexicon): df = pd.read_csv(lexicon, sep='\t') df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_') return df def get_id_picto_from_predicted_lemma(df_lexicon, lemma): id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist() return (id_picto[0], lemma) if id_picto else (0, lemma) lexicon = read_lexicon("lexicon.csv") sentence_to_map = process_output_trad(pred) pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map] ``` ## Viewing the predicted sequence of ARASAAC pictograms in a HTML file ```python def generate_html(ids): html_content = '' for picto_id, lemma in ids: if picto_id != 0: # ignore invalid IDs img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png" html_content += f'''
{lemma}
{lemma}
''' html_content += '' return html_content html = generate_html(pictogram_ids) with open("pictograms.html", "w") as file: file.write(html) ``` ## Information - **Language(s):** French - **License:** Apache-2.0 - **Developed by:** Cécile Macaire - **Funded by** - GENCI-IDRIS (Grant 2023-AD011013625R1) - PROPICTO ANR-20-CE93-0005 - **Authors** - Cécile Macaire - Chloé Dion - Emmanuelle Esperança-Rodier - Benjamin Lecouteux - Didier Schwab ## Citation If you use this model for your own research work, please cite as follows: ```bibtex @inproceedings{macaire_jeptaln2024, title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}}, author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle}, url = {https://inria.hal.science/hal-04623007}, booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}}, address = {Toulouse, France}, publisher = {{ATALA \& AFPC}}, volume = {1 : articles longs et prises de position}, pages = {22-35}, year = {2024} } ```