English

The LVDR Benchmark (Long Video Description Ranking)

This benchmark is proposed from VideoCLIP-XL. Given each video and its corresponding ground-truth description, we perform a synthesis process that iterates p − 1 times and alters q words as hallucination during each iteration, resulting in totally p descriptions with gradually increasing degrees of hallucination. We denote such a subset as p × q and construct five subsets as {4 × 1, 4 × 2, 4 × 3, 4 × 4, 4 × 5}. The video CLIP models need to be able to correctly rank these descriptions in descending order of similarity given the video.

Format

{
  "long_captions": [
        "...",
    ],
  "video_id": "..."
}
{
  .....
},
.....

Source

@misc{wang2024videoclipxladvancinglongdescription,
      title={VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models}, 
      author={Jiapeng Wang and Chengyu Wang and Kunzhe Huang and Jun Huang and Lianwen Jin},
      year={2024},
      eprint={2410.00741},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.00741}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.