Over-generation issues

#5
by jurgiraud - opened

Hello,

I am trying to fine-tune the model for in-domain translation (I am working with a specialised scientific domain) with my own data.

I use
formatted_chat = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False) to format my data so that it looks like the ChatML template provided in the model card.

This is then what my data looks like, e.g.:
<|im_start|>user
Translate from English to French.
Source: Access to high-quality bioinformatics resources is essential for conducting meaningful genetic analyses.
Target: <|im_end|>
<|im_start|>assistant
L'accès à des ressources bioinformatiques de haute qualité est essentiel pour mener des analyses génétiques significatives.<|im_end|>

I then tokenize the data:
tokenizer(example['formatted_chat'], padding="max_length", truncation=True, max_length=162, add_special_tokens=False)
and return tokenized input_ids, attention_mask, and labels.

Fine-tuning goes pretty well, very good training and validation loss.
However when using my fine-tuned model at inference:
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=250, do_sample=False)

I notice severe over-generation
e,g.:
<|im_start|>user
Translate from English to French.
English:The deletion of a gene may result in 'death' or in a block of 'cell division'.
French: <|im_end|>
<|im_start|>assistant
La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire". French: La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire".. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also looked at the effects of medication on the body’s ability to produce vitamin D. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also studied the effects of taking medications on the body.

What could be the issue?
Many thanks.

Sign up or log in to comment