How do you fine tune LLaVA NeXT?

#5
by Nishgop - opened

Is there a way to fine tune LLaVA-NeXT?

Llava Hugging Face org

cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.

For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.

Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?

Nishant

Llava Hugging Face org

Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.

For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.

Hi @nielsr , thanks for all the work! If I understand correctly, as the PR you mentioned above has been merged, training should now work properly for LLaVA-Next (LLaMA 8B + 72B and 110B) models and it already worked for LLaVA1.6? Do you know of any example scripts or articles?

Llava Hugging Face org

Hi @lcolonn ! Yes, the PR was merged and LLaVa-NeXT is tunable now. Fine-tuning script is almost the same as LLaVa with a few changes in input arguments, find here my adaptation of Niels' notebook

Hey @RaushanTurganbay , very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next

Llava Hugging Face org

Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that

Llava Hugging Face org

Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models

Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850

Llava Hugging Face org

I updated the comment in PR to (with and w/o images). The model should be tunable with images as well

This comment has been hidden

Sign up or log in to comment