Update README.md

63f4115 verified 5 months ago

5.94 kB

	---
	license: apache-2.0
	language:
	- fr
	library_name: transformers
	tags:
	- nllb
	- commonvoice
	- pytorch
	- pictograms
	- translation
	metrics:
	- bleu
	inference: false
	---

	# t2p-nllb-200-distilled-600M-commonvoice

	t2p-nllb-200-distilled-600M-commonvoice is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
	The model is used only for inference.

	## Training details

	### Datasets

	The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus.
	This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
	\| Split \| Number of utterances \|
	\|:-----------:\|:-----------------------:\|
	\| train \| 527,390 \|
	\| valid \| 16,124 \|
	\| test \| 16,120 \|

	### Parameters

	A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :

	```python
	training_args = Seq2SeqTrainingArguments(
	output_dir="checkpoints_commonvoice/",
	evaluation_strategy="epoch",
	save_strategy="epoch",
	learning_rate=2e-5,
	per_device_train_batch_size=32,
	per_device_eval_batch_size=32,
	weight_decay=0.01,
	save_total_limit=3,
	num_train_epochs=40,
	predict_with_generate=True,
	fp16=True,
	load_best_model_at_end=True
	)
	```

	### Evaluation

	The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.

	### Results

	Comparison to other translation models :
	\| Model \| validation \| test \|
	\|:-----------:\|:-----------------------:\|:-----------------------:\|
	\| t2p-t5-large-commonvoice \| 86.3 \| 86.5 \|
	\| t2p-nmt-commonvoice \| 86.0 \| 82.6 \|
	\| t2p-mbart-large-cc25-commonvoice \| 72.3 \| 72.3 \|
	\| t2p-nllb-200-distilled-600M-commonvoice \| 87.4 \| 87.6 \|

	### Environmental Impact

	Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 30 hours in total.

	## Using t2p-nllb-200-distilled-600M-commonvoice model with HuggingFace transformers

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	source_lang = "fr"
	target_lang = "frp"
	max_input_length = 128
	max_target_length = 128

	tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-commonvoice")
	model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-commonvoice")

	inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
	outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
	pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms

	```python
	import pandas as pd

	def process_output_trad(pred):
	return pred.split()

	def read_lexicon(lexicon):
	df = pd.read_csv(lexicon, sep='\t')
	df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_')
	return df

	def get_id_picto_from_predicted_lemma(df_lexicon, lemma):
	id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist()
	return (id_picto[0], lemma) if id_picto else (0, lemma)

	lexicon = read_lexicon("lexicon.csv")
	sentence_to_map = process_output_trad(pred)
	pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map]
	```

	## Viewing the predicted sequence of ARASAAC pictograms in a HTML file

	```python
	def generate_html(ids):
	html_content = '<html><body>'
	for picto_id, lemma in ids:
	if picto_id != 0: # ignore invalid IDs
	img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png"
	html_content += f'''
	<figure style="display:inline-block; margin:1px;">
	<img src="{img_url}" alt="{lemma}" width="200" height="200" />
	<figcaption>{lemma}</figcaption>
	</figure>
	'''
	html_content += '</body></html>'
	return html_content

	html = generate_html(pictogram_ids)
	with open("pictograms.html", "w") as file:
	file.write(html)
	```

	## Information

	- Language(s): French
	- License: Apache-2.0
	- Developed by: Cécile Macaire
	- Funded by
	- GENCI-IDRIS (Grant 2023-AD011013625R1)
	- PROPICTO ANR-20-CE93-0005
	- Authors
	- Cécile Macaire
	- Chloé Dion
	- Emmanuelle Esperança-Rodier
	- Benjamin Lecouteux
	- Didier Schwab


	## Citation

	If you use this model for your own research work, please cite as follows:

	```bibtex
	@inproceedings{macaire_jeptaln2024,
	title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
	author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
	url = {https://inria.hal.science/hal-04623007},
	booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
	address = {Toulouse, France},
	publisher = {{ATALA \& AFPC}},
	volume = {1 : articles longs et prises de position},
	pages = {22-35},
	year = {2024}
	}
	```