How do you fine tune LLaVA NeXT?
Is there a way to fine tune LLaVA-NeXT?
cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.
For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.
Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?
Nishant
.
Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.
For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.
Hey
@RaushanTurganbay
, very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next
Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that
Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models
Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850
I updated the comment in PR to (with and w/o images). The model should be tunable with images as well