James Le's picture
4

James Le

khanhnamle1994
ยท

AI & ML interests

Multimodal AI, Video Understanding

Recent Activity

Organizations

Twelve Labs's profile picture

khanhnamle1994's activity

reacted to merve's post with ๐Ÿ‘ 4 months ago
view post
Post
2376
NVIDIA just dropped NVEagle ๐Ÿฆ…

Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat ๐Ÿ’ฌ
Model repositories: merve/nveagle-66d0705108582d73bb235c26
Try it: NVEagle/Eagle-X5-13B-Chat ๐Ÿ’ฌ (works very well! ๐Ÿคฏ)

This model essentially explores having different experts (MoE) for image encoder part of vision language model.
How? ๐Ÿง
The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder.

Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning โœจ

In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well.
Rest of the architecture is quite similar to LLaVA. (see below the architecture)