Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? Paper • 2406.07546 • Published Jun 11, 2024 • 8
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published Jun 12, 2024 • 25
Learning Concise and Descriptive Attributes for Visual Recognition Paper • 2308.03685 • Published Aug 7, 2023
Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Paper • 2305.13903 • Published May 23, 2023
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings Paper • 2305.02317 • Published May 3, 2023
Multimodal Procedural Planning via Dual Text-Image Prompting Paper • 2305.01795 • Published May 2, 2023 • 1
WikiWhy: Answering and Explaining Cause-and-Effect Questions Paper • 2210.12152 • Published Oct 21, 2022 • 1
ImagenHub: Standardizing the evaluation of conditional image generation models Paper • 2310.01596 • Published Oct 2, 2023 • 18
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis Paper • 2210.05035 • Published Oct 10, 2022
VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following Paper • 2311.17647 • Published Nov 29, 2023
Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks Paper • 2210.15629 • Published Oct 27, 2022
Neuro-Symbolic Procedural Planning with Commonsense Prompting Paper • 2206.02928 • Published Jun 6, 2022