Files changed (1) hide show
  1. README.md +46 -20
README.md CHANGED
@@ -3,15 +3,18 @@ library_name: transformers
3
  tags: []
4
  ---
5
 
6
- [📃Paper](https://arxiv.org/abs/2406.15252) | [🌐Website](https://tiger-ai-lab.github.io/VideoScore/) | [💻Github](https://github.com/TIGER-AI-Lab/VideoScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/VideoScore) | [🤗Demo](https://huggingface.co/spaces/TIGER-Lab/VideoScore) | [📉Wandb](https://api.wandb.ai/links/xuanhe/ptohlfcx)
7
 
8
 
9
  ![VideoScore](https://tiger-ai-lab.github.io/VideoScore/static/images/teaser.png)
10
 
11
  ## Introduction
12
- - VideoScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
 
 
 
13
  and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
14
- a large video evaluation dataset with multi-aspect human scores.
15
 
16
  - VideoScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
17
 
@@ -21,19 +24,31 @@ a large video evaluation dataset with multi-aspect human scores.
21
 
22
  ## Evaluation Results
23
 
24
- We test our video evaluation model VideoScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
25
- For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
26
  averaged among all the evaluation aspects as indicator.
27
- For GenAI-Bench and VBench, which include human preference data among two or more videos,
28
- we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
29
-
30
- - We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset
31
- for VideoFeedback-test set, while for other three benchmarks.
32
 
33
- - We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
34
- excluding the real videos.
35
-
36
- The evaluation results are coming soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Usage
39
  ### Installation
@@ -57,7 +72,6 @@ from mantis.models.qwen2_vl import Qwen2VLForSequenceClassification
57
  from transformers import Qwen2VLProcessor
58
  from qwen_vl_utils import process_vision_info
59
 
60
- MAX_NUM_FRAMES=16
61
  ROUND_DIGIT=3
62
  REGRESSION_QUERY_PROMPT = """
63
  Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
@@ -81,9 +95,9 @@ factual consistency: 1.8
81
 
82
  For this video, the text prompt is "{text_prompt}",
83
  all the frames of video are as follows:
84
- """
85
 
86
- model_name="Mantis-VL/qwen2-vl-video-eval-debug_12288_regression"
87
  video_path="video1.mp4"
88
  video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense."
89
 
@@ -96,6 +110,10 @@ model = Qwen2VLForSequenceClassification.from_pretrained(
96
  processor = Qwen2VLProcessor.from_pretrained(model_name)
97
 
98
  # Messages containing a images list as a video and a text query
 
 
 
 
99
  messages = [
100
  {
101
  "role": "user",
@@ -107,12 +125,18 @@ messages = [
107
  },
108
  {"type": "text", "text": REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)},
109
  ],
 
 
 
 
 
 
110
  }
111
  ]
112
 
113
  # Preparation for inference
114
  text = processor.apply_chat_template(
115
- messages, tokenize=False, add_generation_prompt=True
116
  )
117
  image_inputs, video_inputs = process_vision_info(messages)
118
  inputs = processor(
@@ -123,7 +147,6 @@ inputs = processor(
123
  return_tensors="pt",
124
  )
125
  inputs = inputs.to("cuda")
126
- print(inputs['input_ids'].shape)
127
 
128
  # Inference
129
  with torch.no_grad():
@@ -140,8 +163,11 @@ print(aspect_scores)
140
  """
141
  model output on visual quality, temporal consistency, dynamic degree,
142
  text-to-video alignment, factual consistency, respectively
 
 
143
 
144
- [3.578, 3.594, 3.703, 3.156, 3.688]
 
145
  """
146
  ```
147
 
 
3
  tags: []
4
  ---
5
 
6
+ [📃Paper](https://arxiv.org/abs/2406.15252) | [🌐Website](https://tiger-ai-lab.github.io/VideoScore/) | [💻Github](https://github.com/TIGER-AI-Lab/VideoScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model (VideoScore)](https://huggingface.co/TIGER-Lab/VideoScore) | [🤗Demo](https://huggingface.co/spaces/TIGER-Lab/VideoScore)
7
 
8
 
9
  ![VideoScore](https://tiger-ai-lab.github.io/VideoScore/static/images/teaser.png)
10
 
11
  ## Introduction
12
+ - 🧐🧐[VideoScore-Qwen2-VL](https://huggingface.co/TIGER-Lab/VideoScore-Qwen2-VL) is a variant from [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore),
13
+ taking [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) as base model, and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) dataset.
14
+
15
+ - [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) series is a video quality evaluation model series, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) or [Qwen/Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) as base-model
16
  and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
17
+ a large video evaluation dataset with multi-aspect human scores.
18
 
19
  - VideoScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
20
 
 
24
 
25
  ## Evaluation Results
26
 
27
+ We test VideoScore-Qwen2-VL on VideoFeedback-test and take Spearman corrleation between model's output and human ratings
 
28
  averaged among all the evaluation aspects as indicator.
 
 
 
 
 
29
 
30
+ The evaluation results are shown below:
31
+
32
+ | metric | VideoFeedback-test |
33
+ |:-----------------:|:------------------:|
34
+ | VideoScore-Qwen2-VL | **74.9** |
35
+ | Gemini-1.5-Pro | 22.1 |
36
+ | Gemini-1.5-Flash | 20.8 |
37
+ | GPT-4o | <u>23.1</u> |
38
+ | CLIP-sim | 8.9 |
39
+ | DINO-sim | 7.5 |
40
+ | SSIM-sim | 13.4 |
41
+ | CLIP-Score | -7.2 |
42
+ | LLaVA-1.5-7B | 8.5 |
43
+ | LLaVA-1.6-7B | -3.1 |
44
+ | X-CLIP-Score | -1.9 |
45
+ | PIQE | -10.1 |
46
+ | BRISQUE | -20.3 |
47
+ | Idefics2 | 6.5 |
48
+ | MSE-dyn | -5.5 |
49
+ | SSIM-dyn | -12.9 |
50
+
51
+ The best in VideoScore series is in bold and the best in baselines is underlined.
52
 
53
  ## Usage
54
  ### Installation
 
72
  from transformers import Qwen2VLProcessor
73
  from qwen_vl_utils import process_vision_info
74
 
 
75
  ROUND_DIGIT=3
76
  REGRESSION_QUERY_PROMPT = """
77
  Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
 
95
 
96
  For this video, the text prompt is "{text_prompt}",
97
  all the frames of video are as follows:
98
+ """
99
 
100
+ model_name="TIGER-Lab/VideoScore-Qwen2-VL"
101
  video_path="video1.mp4"
102
  video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense."
103
 
 
110
  processor = Qwen2VLProcessor.from_pretrained(model_name)
111
 
112
  # Messages containing a images list as a video and a text query
113
+ response = ""
114
+ label_names = ["visual quality", "temporal consistency", "dynamic degree", "text-to-video alignment", "factual consistency"]
115
+ for i in range(len(label_names)):
116
+ response += f"The score for {label_names[i]} is {model.config.label_special_tokens[i]}. "
117
  messages = [
118
  {
119
  "role": "user",
 
125
  },
126
  {"type": "text", "text": REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)},
127
  ],
128
+ },
129
+ {
130
+ "role": "assistant",
131
+ "content": [
132
+ {"type": "text", "text": response},
133
+ ],
134
  }
135
  ]
136
 
137
  # Preparation for inference
138
  text = processor.apply_chat_template(
139
+ messages, tokenize=False, add_generation_prompt=False
140
  )
141
  image_inputs, video_inputs = process_vision_info(messages)
142
  inputs = processor(
 
147
  return_tensors="pt",
148
  )
149
  inputs = inputs.to("cuda")
 
150
 
151
  # Inference
152
  with torch.no_grad():
 
163
  """
164
  model output on visual quality, temporal consistency, dynamic degree,
165
  text-to-video alignment, factual consistency, respectively
166
+ VideoScore:
167
+ [2.297, 2.469, 2.906, 2.766, 2.516]
168
 
169
+ VideoScore-Qwen2-VL:
170
+ [2.297, 2.531, 2.766, 2.312, 2.547]
171
  """
172
  ```
173