mfarre's picture
mfarre HF staff
fix task tag (#2)
39e81ba verified
---
datasets:
- tomg-group-umd/cinepile
language:
- en
license: apache-2.0
library_name: peft
pipeline_tag: video-text-to-text
---
# Model Card for Video-LLaVA - CinePile fine tune
![CinePile Benchmark - Open vs Closed models](https://huggingface.co/mfarre/Video-LLaVA-7B-hf-CinePile/resolve/main/benchmark.png)
Video multimodal research often emphasizes activity recognition and object-centered tasks, such as determining "what is the person wearing a red hat doing?" However, this focus overlooks areas like theme exploration, narrative and plot analysis, and character and relationship dynamics.
CinePile addresses these areas in their benchmark and they find that Large Language Models significantly lag behind human performance in these tasks. Additionally, there is a notable disparity in performance between open and closed models.
In our initial fine-tuning, our goal was to assess how well open models can approach the performance of closed models. By fine-tuning Video LlaVa, we achieved performance levels comparable to those of Claude 3 (Opus).
## Results
| Model | Average | Character and relationship dynamics | Narrative and Plot Analysis | Setting and Technical Analysis | Temporal | Theme Exploration |
|--------------------------------|---------|-------------------------------------|-----------------------------|--------------------------------|----------|-------------------|
| Human | 73.21 | 82.92 | 75 | 73 | 75.52 | 64.93 |
| Human (authors) | 86 | 92 | 87.5 | 71.2 | 100 | 75 |
| GPT-4o | 59.72 | 64.36 | 74.08 | 54.77 | 44.91 | 67.89 |
| GPT-4 Vision | 58.81 | 63.73 | 73.43 | 52.55 | 46.22 | 65.79 |
| Gemini 1.5 Pro | 61.36 | 65.17 | 71.01 | 59.57 | 46.75 | 63.27 |
| Gemini 1.5 Flash | 57.52 | 61.91 | 69.15 | 54.86 | 41.34 | 61.22 |
| Gemini Pro Vision | 50.64 | 54.16 | 65.5 | 46.97 | 35.8 | 58.82 |
| Claude 3 (Opus) | 45.6 | 48.89 | 57.88 | 40.73 | 37.65 | 47.89 |
| **Video LlaVa - this fine-tune** | **44.16** | **45.26** | **45.14** | **46.93** | **32.55** | **49.47** |
| Video LLaVa | 22.51 | 23.11 | 25.92 | 20.69 | 22.38 | 22.63 |
| mPLUG-Owl | 10.57 | 10.65 | 11.04 | 9.18 | 11.89 | 15.05 |
| Video-ChatGPT | 14.55 | 16.02 | 14.83 | 15.54 | 6.88 | 18.86 |
| MovieChat | 4.61 | 4.95 | 4.29 | 5.23 | 2.48 | 4.21 |
Fine-tuned model taking as starting point [Video-LlaVA](https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf) to evaluate its performance on CinePile.
## Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [Github](https://github.com/mfarre/Video-LLaVA-7B-hf-CinePile) with fine-tunning and inference notebook.
## Uses
Although the model can answer questions based on the content, it is specifically optimized for addressing CinePile-related queries.
When the questions do not follow a CinePile-specific prompt, the inference section of the notebook is designed to refine and clean up the text produced by the model.