arxiv:2411.10867

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Published on Nov 16

· Submitted by

amanchadha on Nov 21

Upvote

Authors:

Aman Chadha ,

Abstract

Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter about 10 hours ago

The paper introduces ViBe, a comprehensive benchmark and dataset for analyzing and categorizing hallucinations in text-to-video (T2V) generation models, aiming to enhance their reliability and alignment with input prompts.
Novel Dataset and Benchmark: ViBe is a large-scale dataset featuring 3,782 human-annotated videos from T2V models, categorized into five types of hallucinations, including physical incongruities and temporal inconsistencies.
Evaluation Framework: The paper establishes baseline performance for detecting hallucinations using classifiers like CNNs and Transformers, with TimeSFormer embeddings achieving the best accuracy (0.345) and F1 score (0.342).
Future Directions: ViBe provides a foundation for improving hallucination detection in T2V models, highlighting areas such as multi-hallucination detection and mitigating annotation subjectivity.

amanchadha

Paper author Paper submitter about 10 hours ago

The paper introduces ViBe, a comprehensive benchmark and dataset for analyzing and categorizing hallucinations in text-to-video (T2V) generation models, aiming to enhance their reliability and alignment with input prompts.
Novel Dataset and Benchmark: ViBe is a large-scale dataset featuring 3,782 human-annotated videos from T2V models, categorized into five types of hallucinations, including physical incongruities and temporal inconsistencies.
Evaluation Framework: The paper establishes baseline performance for detecting hallucinations using classifiers like CNNs and Transformers, with TimeSFormer embeddings achieving the best accuracy (0.345) and F1 score (0.342).
Future Directions: ViBe provides a foundation for improving hallucination detection in T2V models, highlighting areas such as multi-hallucination detection and mitigating annotation subjectivity.

librarian-bot

about 5 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.10867 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.10867 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.10867 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.