Model Type,Model,Language Model,Model Size,Evaluation Method,Avg. All,Avg. Img,Avg. Video,Scene Understanding,Instance Identity,Instance Attribute,Instance Location,Instance Counting,Spatial Relation,Instance Interaction,Visual Reasoning,Text Recognition,Action Recognition,Action Prediction,Procedure Understanding LLM,[Flan-T5](https://huggingface.co/google/flan-t5-xl),Flan-T5-XL,3B,PPL,26.9,26.6,27.8,23,29,32.8,31.8,20.5,31.8,33,18.2,19.4,23.2,34.9,25.4 LLM,[Vicuna](https://huggingface.co/lmsys/vicuna-7b-v1.3),Vicuna-7B,7B,PPL,26.8,26.2,28.5,23.4,30.7,29.7,30.9,30.8,28.6,29.8,18.5,13.4,27.3,34.5,23.8 LLM,[LLaMA](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/),LLaMA-7B,7B,PPL,25.8,25.3,27.4,26.3,27.4,26.2,28.3,25.1,28.8,19.2,37,9,33,23.1,26.2 ImageLLM,[BLIP-2](https://github.com/salesforce/LAVIS),Flan-T5-XL,3B,PPL,43.0,45.7,34.7,59.1,53.9,49.2,42.3,43.2,36.7,55.7,45.6,25.9,32.6,47.5,24 ImageLLM,[InstructBLIP](https://github.com/salesforce/LAVIS),Flan-T5-XL,3B,PPL,46.1,49.3,36.4,60.3,58.5,63.4,40.6,58.4,38.7,51.6,45.9,25.9,33.1,49.1,27.1 ImageLLM,[InstructBLIP-Vicuna](https://github.com/salesforce/LAVIS),Vicuna-7B,7B,PPL,48.1,52.2,35.7,60.2,58.9,65.6,43.6,57.2,40.3,52.6,47.7,43.5,34.5,49.6,23.1 ImageLLM,[LLaVA-1.5](https://github.com/haotian-liu/LLaVA),Vicuna-13B,13B,Generate,60.7,66.9,42.2,74.9,71.3,68.9,63.5,61.3,51.4,73.2,77,60.5,48.9,41.1,36.6 ImageLLM,[LLaVA-v1.5-13B-LoRA](https://llava-vl.github.io),Vicuna-13B-v1.5,13B,PPL,57.4,63.3,39.7,74.9,70.9,70.1,62.5,60.6,52.4,74.2,77.3,26.7,47.5,36,35.7 ImageLLM,[LLaVA-v1.5-LoRA](https://llava-vl.github.io),Vicuna-13B-v1.5,13B,PPL for A/B/C/D,58.3,64.8,38.6,75.2,71.4,72,62.7,59.8,51.1,71.1,80.7,39.5,46.1,37.1,32.6 ImageLLM,[MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4),Vicuna-7B,7B,PPL,39.4,42.6,29.9,56.3,49.2,45.8,37.9,45.3,32.6,47.4,57.1,11.8,38.2,24.5,27.1 ImageLLM,[VPGTrans](https://github.com/VPGTrans/VPGTrans),LLaMA-7B,7B,PPL,37.8,39.8,31.9,51.9,44.1,39.9,36.1,33.7,36.4,32,53.2,30.6,39.5,24.3,31.9 ImageLLM,[MultiModal-GPT](https://github.com/open-mmlab/Multimodal-GPT),LLaMA-7B,7B,PPL,32.3,33.5,28.9,43.6,37.9,31.5,30.8,27.3,30.1,29.9,51.4,18.8,36.9,25.8,24 ImageLLM,[Otter](https://github.com/Luodian/Otter),LLaMA-7B,7B,PPL,34.2,35.5,30.0,44.9,38.6,32.2,30.9,26.3,31.8,32,51.4,31.8,37.9,27.2,24.8 ImageLLM,[Otter](https://github.com/Luodian/Otter),MPT-7B,7B,PPL,37.6,40.1,29.9,51.3,43.5,42.3,34.2,38.4,30.9,40.2,55.3,24.7,36.8,29.2,23.8 ImageLLM,[OpenFlamingo](https://github.com/mlfoundations/open_flamingo),LLaMA-7B,7B,PPL,32.4,33.5,28.9,43.9,38.1,31.3,30.1,27.3,30.6,29.9,50.2,20,37.2,25.4,24.2 ImageLLM,[OpenFlamingo](https://github.com/mlfoundations/open_flamingo),MPT-7B,7B,PPL,38.3,39.4,34.8,53.2,45.3,40,31.2,39.3,32.6,36.1,51.4,25.9,42.9,34.7,26.9 ImageLLM,[LLaMA-AdapterV2](https://github.com/OpenGVLab/LLaMA-Adapter),LLaMA-7B,7B,PPL,33.7,36.3,25.6,45.2,38.5,29.3,33,29.7,35.5,39.2,52,24.7,38.6,18.5,19.6 ImageLLM,[GVT](https://github.com/TencentARC/GVT),Vicuna-7B,7B,PPL,33.3,35.2,27.4,41.7,35.5,31.8,29.5,36.2,32,32,51.1,27.1,33.9,25.4,23 ImageLLM,[mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl),LLaMA-7B,7B,PPL,35.3,39.1,23.7,49.7,45.3,32.5,36.7,27.3,32.7,44.3,54.7,28.8,26.7,17.9,26.5 ImageLLM,[Kosmos-2](https://github.com/microsoft/unilm/tree/master/kosmos-2),Decoder Only 1.3B,1.3B,PPL,46.1,49.4,36.2,63.4,57.1,58.5,44,41.4,37.9,55.7,60.7,25.9,41.3,40.4,27 ImageLLM,[Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat),Qwen-7B,7B,PPL for A/B/C/D,55.6,61.9,36.6,73.3,67.3,69.6,57.7,52.9,48.2,59.8,74.6,53.5,43.9,39.2,26.7 ImageLLM,[Qwen-VL](https://huggingface.co/Qwen/Qwen-VL),Qwen-7B,7B,PPL for A/B/C/D,54.3,59.6,38.4,71.2,66.4,67.7,53.5,44.8,43.8,62.9,74.9,51.2,44.7,38.5,32 ImageLLM,[Qwen-VL-plus](https://github.com/QwenLM/Qwen-VL/tree/master?tab=readme-ov-file#qwen-vl-plus),Qwen-LM,-,PPL for A/B/C/D,62.6,68.8,43.9,76.5,77.6,75.3,64.9,66.3,56.8,69.1,78.2,54.7,51.6,39.2,41 ImageLLM,[IDEFICS-9b-instruct](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct),LLaMA-7B,7B,NG,32.3,43.0,0.0,55.8,45.3,42.3,40.2,36.8,34.9,37.1,55.9,38.8,0,0,0 ImageLLM,[IDEFICS-80b-instruct](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct),LLaMA-65B,65B,NG,40.8,54.4,0.0,64,52.6,50.8,48.3,46.1,45.5,62.9,68,51.8,0,0,0 ImageLLM,[InternLM-XComposer-VL](https://github.com/InternLM/InternLM-XComposer),InternLM-7B,7B,PPL,48.9,65.2,0.0,75,71.7,67.6,60.8,56.2,55.3,74.4,77,48.5,0,0,0 ImageLLM,[InternLM-XComposer2-VL-7B](https://github.com/InternLM/InternLM-XComposer),InternLM2,7B,Generate,0,74.4,0,79.3,78.4,77.8,72.7,70,63.5,79.4,80.1,68.6,0,0,0 ImageLLM,[InternVL-Chat-V1.2-Plus](https://github.com/OpenGVLab/InternVL),Nous-Hermes-2-Yi-34B,34B,Generate,66.3,72.4,47.8,80.2,80,77.8,71.3,72.3,63.3,77.3,79.8,50,49.4,41.8,52.2 ImageLLM,[SEED-LLaMA](https://github.com/AILab-CVC/SEED),LLaMA2-Chat-13B,13B,PPL,46.9,51.0,34.4,64.1,54.2,54.1,46.5,45.3,38.2,51.6,60.7,44.7,37.8,45.3,20 ImageLLM,[mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl),LLaMA-7B,7B,NG,55.1,60.4,39.2,72.7,67.6,63.6,53.6,58.5,50.8,70.1,76.4,30.2,46,38.7,32.9 ImageLLM,[LLaMA-VID-7B](https://github.com/dvlab-research/LLaMA-VID),LLaMA-7B,7B,Generate,58.5,65.4,37.9,75.4,71.2,68.9,62.9,58.4,50.7,70.1,76.1,54.7,42.8,35.2,35.6 ImageLLM,[Pink-LLaMA2](https://github.com/SY-Xuan/Pink/stargazers),LLaMA2-7B,7B,NG,48.0,64.0,0.0,75.2,70.1,70.1,63.3,53.8,50.2,69.1,74.3,50,0,0,0 ImageLLM,[InfMLLM-13B](https://github.com/mightyzau/InfMLLM),Vicuna-13B,13B,Generate,59.4,65.5,40.8,75.5,73,70.4,66.2,63.3,54.2,72.2,77.9,37.2,49.5,39,33.9 ImageLLM,[ShareGPT4V-7B](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V),Vicuna-7B,7B,Generate,50.2,67.0,0.0,75.3,71.4,72.3,63.1,62,53.9,70.1,79.8,54.7,0,0,0 ImageLLM,[ShareGPT4V-13B](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V),Vicuna-13B,13B,Generate,50.6,67.4,0.0,75.9,74.1,73.5,66.8,62.4,54.8,75.3,77.3,46.5,0,0,0 ImageLLM,[Honeybee-13B](https://github.com/kakaobrain/honeybee),Vicuna-13B,13B,Generate,49.1,65.5,0.0,75.4,72.8,69,64.5,60.6,55.1,72.2,77.9,41.9,0,0,0 ImageLLM,[SPHINXv1-1k](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX),LLaMA-2-13B,13B,Generate,59.4,67.9,33.7,75.4,72.2,75.1,64.2,68.2,49.3,66,78.6,62.4,41.2,33.9,26.1 ImageLLM,[SPHINXv2-1k](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX),LLaMA-2-13B,13B,Generate,64.2,72.7,38.6,77.7,77.4,76.8,69.4,71.2,59.4,70.1,78.3,74.1,48.1,37.9,29.9 ImageLLM,[GPT-4V](https://openai.com/research/gpt-4v-system-card),\-,-,Generate,65.7,67.5,60.3,77.5,73.9,70.6,61.8,56.8,56.9,74.2,78.5,57.6,65.7,51.7,63.4 VideoLLM,[VideoChat](https://github.com/OpenGVLab/Ask-Anything),Vicuna-7B,7B,PPL,36.9,38.2,32.9,47.1,43.8,34.9,40,32.8,34.6,42.3,50.5,17.7,34.9,36.4,27.3 VideoLLM,[Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT),LLaMA-7B,7B,PPL,29.8,31.9,23.3,37.2,31.4,33.2,28.4,35.5,29.5,23.7,42.3,25.9,27.6,21.3,21.1 VideoLLM,[Valley](https://github.com/RupertLuo/Valley),LLaMA-13B,13B,PPL,28.7,29.9,25.1,39.3,32.9,31.6,27.9,24.2,30.1,27.8,43.8,11.8,31.3,23.2,20.7 Other,[Unified-IO-2 7B (2.5M)](https://unified-io-2.allenai.org),from scratch,7B,PPL,57.6,61.8,44.9,70.7,69,67.4,55.4,62.6,45.5,60.8,67.1,58.1,57.5,43.2,34 Other,[Unified-IO-2 7B](https://unified-io-2.allenai.org),from scratch,7B,PPL,57.8,62.0,44.9,71.3,68.8,67.5,55.5,61.2,45.4,62.9,66.5,59.3,58,42.7,34 Other,[Unified-IO-2 3B (3M)](https://unified-io-2.allenai.org),from scratch,3B,PPL,54.5,57.9,44.4,69,66.6,66.5,54.3,62,42.3,50.5,65.3,44.2,57.5,36.2,39.4 Other,[Unified-IO-2 3B](https://unified-io-2.allenai.org),from scratch,3B,PPL,54.4,57.8,44.2,68.8,65.8,67.2,52.9,60.4,43.1,55.7,64,41.9,57.5,36,39 Other,[Unified-IO-2 1B](https://unified-io-2.allenai.org),from scratch,1B,PPL,46.8,51.4,33.0,63.8,57.7,54.6,41.9,53.7,33.3,51.5,58.3,47.7,39.8,34.5,24.6