flow2023
's Collections
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
•
2312.16862
•
Published
•
28
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
25
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
Programmers
Paper
•
2401.01974
•
Published
•
4
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper
•
2401.01885
•
Published
•
26
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
Models
Paper
•
2401.01335
•
Published
•
61
Improving Text Embeddings with Large Language Models
Paper
•
2401.00368
•
Published
•
77
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
13
Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
Paper
•
2401.05033
•
Published
•
14
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
•
2401.06071
•
Published
•
10
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
Concept Understanding
Paper
•
2401.04575
•
Published
•
14
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers
Paper
•
2401.04695
•
Published
•
8
Paper
•
2401.04088
•
Published
•
154
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes
Interactively
Paper
•
2401.02955
•
Published
•
16
Understanding LLMs: A Comprehensive Overview from Training to Inference
Paper
•
2401.02038
•
Published
•
60
Can Large Language Models Understand Context?
Paper
•
2402.00858
•
Published
•
20
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
Paper
•
2401.17093
•
Published
•
18
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
•
2401.16420
•
Published
•
54
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
•
2401.15947
•
Published
•
47
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD
Generalization
Paper
•
2401.15914
•
Published
•
7
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
•
2401.13601
•
Published
•
41
Small Language Model Meets with Reinforced Vision Vocabulary
Paper
•
2401.12503
•
Published
•
31
Large Language Models are Superpositions of All Characters: Attaining
Arbitrary Role-play via Self-Alignment
Paper
•
2401.12474
•
Published
•
33
Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated
Text
Paper
•
2401.12070
•
Published
•
42
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities
Paper
•
2401.12168
•
Published
•
22
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
123
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
•
2403.04132
•
Published
•
38
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
•
2402.10986
•
Published
•
74
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
•
2402.10644
•
Published
•
74
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Paper
•
2402.01622
•
Published
•
31
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
•
2403.15042
•
Published
•
24
When Do We Not Need Larger Vision Models?
Paper
•
2403.13043
•
Published
•
24
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
41
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
28
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
14
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
•
2404.14396
•
Published
•
17
PhysDreamer: Physics-Based Interaction with 3D Objects via Video
Generation
Paper
•
2404.13026
•
Published
•
21
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
36
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
23
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
20
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
25
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
69
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
17
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper
•
2406.06469
•
Published
•
22
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper
•
2406.04692
•
Published
•
49
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
Complex Reasoning Abilities
Paper
•
2406.11768
•
Published
•
19