This model essentially explores having different experts (MoE) for image encoder part of vision language model. How? ๐ง The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.
Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning โจ
In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well. Rest of the architecture is quite similar to LLaVA. (see below the architecture)