ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions Paper • 2303.06594 • Published Mar 12, 2023
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Paper • 2304.10592 • Published Apr 20, 2023
StoryGPT-V: Large Language Models as Consistent Story Visualizers Paper • 2312.02252 • Published Dec 4, 2023 • 1
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations Paper • 2308.16349 • Published Aug 30, 2023
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 25
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos Paper • 2407.12679 • Published Jul 17 • 7
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published Oct 22 • 24
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published Oct 22 • 24
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published Oct 22 • 24 • 2
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling Paper • 2408.03695 • Published Aug 7 • 12
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Paper • 2310.09478 • Published Oct 14, 2023 • 19
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning Paper • 2307.04725 • Published Jul 10, 2023 • 64
TokenFlow: Consistent Diffusion Features for Consistent Video Editing Paper • 2307.10373 • Published Jul 19, 2023 • 56