xiuyul commited on
Commit
e033be6
1 Parent(s): 2fcf5a9
Files changed (2) hide show
  1. app.py +11 -10
  2. result.py +5 -5
app.py CHANGED
@@ -4,7 +4,7 @@ import pandas as pd
4
 
5
 
6
  BASELINE = f'<a target="_blank" href=https://github.com/showlab/loveu-tgve-2023 style="color: blue; text-decoration: underline;text-decoration-style: dotted;">Tune-A-Video (Baseline)</a>'
7
- COLS = ["Method", "Human Eval (Aesthetic) ⬆️", "Human Eval (Structure) ⬆️", "Human Eval (Text Alignment) ⬆️", "Human Eval (Avg.) ⬆️",
8
  "CLIPScore (Frame Consistency) ⬆️", "CLIPScore (Text Alignment) ⬆️", "PickScore ⬆️",
9
  "References"]
10
  TYPES = ["markdown", "number", "number", "number", "number", "number", "number", "number", "markdown"]
@@ -63,18 +63,19 @@ with block:
63
  Leveraging AI for video editing has the potential to unleash creativity for artists across all skill levels. The rapidly-advancing field of Text-Guided Video Editing (TGVE) is here to address this challenge. Recent works in this field include <a href="https://tuneavideo.github.io/" target="_blank">Tune-A-Video</a>, <a href="https://research.runwayml.com/gen2" target="_blank">Gen-2</a>, and <a href="https://dreamix-video-editing.github.io/" target="_blank">Dreamix</a>.
64
  In this competition track, we provide a standard set of videos and prompts. As a researcher, you will develop a model that takes a video and a prompt for how to edit it, and your model will produce an edited video. For instance, you might be given a video of “a man is surfing inside the barrel of a wave,” and your model will edit the video to “a man is surfing on a wave made of aurora borealis.”
65
 
66
- During the competition, evaluation results performed against the following 3 automatic metrics will be displayed on the leaderboard:
67
- - <a href="https://arxiv.org/abs/2103.00020" target="_blank">CLIPScore</a> (Frame Consistency) - the average cosine similarity between all pairs of CLIP image embeddings computed on all frames of output videos.
68
- - <a href="https://arxiv.org/abs/2103.00020" target="_blank">CLIPScore</a> (Text Alignment) - the average CLIP score between all frames of output videos and corresponding edited prompts.
69
- - <a href="https://arxiv.org/abs/2305.01569" target="_blank">PickScore</a> - the average PickScore between all frames of output videos.
 
70
 
71
- After all submissions are uploaded, we will run a human-evaluation of all submitted videos. Specifically, we will have human labelers compare all submitted videos. Labelers will evaluate videos on the following criteria:
72
 
73
- - Text alignment: How well does the generated video match the caption?
74
- - Structure: How well does the generated video preserve the structure of the original video?
75
- - Quality: Aesthetically, how good is this video?
76
 
77
- We will choose a winner and a runner-up based on the human evaluation results.
78
  </font>
79
 
80
  The **bold** method name indicates that the implementation is **official** (by the author / developer of the original method).""")
 
4
 
5
 
6
  BASELINE = f'<a target="_blank" href=https://github.com/showlab/loveu-tgve-2023 style="color: blue; text-decoration: underline;text-decoration-style: dotted;">Tune-A-Video (Baseline)</a>'
7
+ COLS = ["Method", "Human Eval (Text Alignment) ⬆️", "Human Eval (Structure) ⬆️", "Human Eval (Quality) ⬆️", "Human Eval (Avg.) ⬆️",
8
  "CLIPScore (Frame Consistency) ⬆️", "CLIPScore (Text Alignment) ⬆️", "PickScore ⬆️",
9
  "References"]
10
  TYPES = ["markdown", "number", "number", "number", "number", "number", "number", "number", "markdown"]
 
63
  Leveraging AI for video editing has the potential to unleash creativity for artists across all skill levels. The rapidly-advancing field of Text-Guided Video Editing (TGVE) is here to address this challenge. Recent works in this field include <a href="https://tuneavideo.github.io/" target="_blank">Tune-A-Video</a>, <a href="https://research.runwayml.com/gen2" target="_blank">Gen-2</a>, and <a href="https://dreamix-video-editing.github.io/" target="_blank">Dreamix</a>.
64
  In this competition track, we provide a standard set of videos and prompts. As a researcher, you will develop a model that takes a video and a prompt for how to edit it, and your model will produce an edited video. For instance, you might be given a video of “a man is surfing inside the barrel of a wave,” and your model will edit the video to “a man is surfing on a wave made of aurora borealis.”
65
 
66
+ To participate in the contest, you will submit the videos generated by your model. As you develop your model, you may want to visually evaluate your results and use automated metrics such as the <a href="https://arxiv.org/abs/2104.08718" target="_blank">CLIPScore</a> and <a href="https://arxiv.org/abs/2305.01569" target="_blank">PickScore</a> to track your progress:
67
+
68
+ - CLIPScore (Frame Consistency) - the average cosine similarity between all pairs of CLIP image embeddings computed on all frames of output videos.
69
+ - CLIPScore (Text Alignment) - the average CLIP score between all frames of output videos and corresponding edited prompts.
70
+ - PickScore - the average PickScore between all frames of output videos.
71
 
72
+ After all submissions are uploaded, we will run a human-evaluation of all submitted videos. Specifically, we will have human labelers compare all submitted videos to the baseline videos that were edited with the Tune-A-Video model. Labelers will evaluate videos on the following criteria:
73
 
74
+ - Text alignment: Which video better matches the caption?
75
+ - Structure: Which video better preserves the structure of the input video?
76
+ - Quality: Aesthetically, which video is better?
77
 
78
+ We will choose a winner and a runner-up based on the human evaluation results.
79
  </font>
80
 
81
  The **bold** method name indicates that the implementation is **official** (by the author / developer of the original method).""")
result.py CHANGED
@@ -4,7 +4,7 @@ submission_results = [
4
  "CLIPScore (Frame Consistency) ⬆️":91.25,
5
  "CLIPScore (Text Alignment) ⬆️":27.21,
6
  "PickScore ⬆️":20.72,
7
- "Human Eval (Aesthetic) ⬆️":0.465,
8
  "Human Eval (Structure) ⬆️":0.348,
9
  "Human Eval (Text Alignment) ⬆️":0.538,
10
  "Human Eval (Avg.) ⬆️":0.450,
@@ -15,7 +15,7 @@ submission_results = [
15
  "CLIPScore (Frame Consistency) ⬆️":92.27,
16
  "CLIPScore (Text Alignment) ⬆️":25.57,
17
  "PickScore ⬆️":20.22,
18
- "Human Eval (Aesthetic) ⬆️":0.564,
19
  "Human Eval (Structure) ⬆️":0.601,
20
  "Human Eval (Text Alignment) ⬆️":0.531,
21
  "Human Eval (Avg.) ⬆️":0.565,
@@ -26,7 +26,7 @@ submission_results = [
26
  "CLIPScore (Frame Consistency) ⬆️":92.47,
27
  "CLIPScore (Text Alignment) ⬆️":25.53,
28
  "PickScore ⬆️":19.79,
29
- "Human Eval (Aesthetic) ⬆️":0.387,
30
  "Human Eval (Structure) ⬆️":0.402,
31
  "Human Eval (Text Alignment) ⬆️":0.399,
32
  "Human Eval (Avg.) ⬆️":0.396,
@@ -37,7 +37,7 @@ submission_results = [
37
  "CLIPScore (Frame Consistency) ⬆️":92.17,
38
  "CLIPScore (Text Alignment) ⬆️":27.55,
39
  "PickScore ⬆️":20.55,
40
- "Human Eval (Aesthetic) ⬆️":0.438,
41
  "Human Eval (Structure) ⬆️":0.446,
42
  "Human Eval (Text Alignment) ⬆️":0.451,
43
  "Human Eval (Avg.) ⬆️":0.445,
@@ -48,7 +48,7 @@ submission_results = [
48
  "CLIPScore (Frame Consistency) ⬆️":89.90,
49
  "CLIPScore (Text Alignment) ⬆️":26.89,
50
  "PickScore ⬆️":20.71,
51
- "Human Eval (Aesthetic) ⬆️":0.599,
52
  "Human Eval (Structure) ⬆️":0.486,
53
  "Human Eval (Text Alignment) ⬆️":0.689,
54
  "Human Eval (Avg.) ⬆️":0.591,
 
4
  "CLIPScore (Frame Consistency) ⬆️":91.25,
5
  "CLIPScore (Text Alignment) ⬆️":27.21,
6
  "PickScore ⬆️":20.72,
7
+ "Human Eval (Quality) ⬆️":0.465,
8
  "Human Eval (Structure) ⬆️":0.348,
9
  "Human Eval (Text Alignment) ⬆️":0.538,
10
  "Human Eval (Avg.) ⬆️":0.450,
 
15
  "CLIPScore (Frame Consistency) ⬆️":92.27,
16
  "CLIPScore (Text Alignment) ⬆️":25.57,
17
  "PickScore ⬆️":20.22,
18
+ "Human Eval (Quality) ⬆️":0.564,
19
  "Human Eval (Structure) ⬆️":0.601,
20
  "Human Eval (Text Alignment) ⬆️":0.531,
21
  "Human Eval (Avg.) ⬆️":0.565,
 
26
  "CLIPScore (Frame Consistency) ⬆️":92.47,
27
  "CLIPScore (Text Alignment) ⬆️":25.53,
28
  "PickScore ⬆️":19.79,
29
+ "Human Eval (Quality) ⬆️":0.387,
30
  "Human Eval (Structure) ⬆️":0.402,
31
  "Human Eval (Text Alignment) ⬆️":0.399,
32
  "Human Eval (Avg.) ⬆️":0.396,
 
37
  "CLIPScore (Frame Consistency) ⬆️":92.17,
38
  "CLIPScore (Text Alignment) ⬆️":27.55,
39
  "PickScore ⬆️":20.55,
40
+ "Human Eval (Quality) ⬆️":0.438,
41
  "Human Eval (Structure) ⬆️":0.446,
42
  "Human Eval (Text Alignment) ⬆️":0.451,
43
  "Human Eval (Avg.) ⬆️":0.445,
 
48
  "CLIPScore (Frame Consistency) ⬆️":89.90,
49
  "CLIPScore (Text Alignment) ⬆️":26.89,
50
  "PickScore ⬆️":20.71,
51
+ "Human Eval (Quality) ⬆️":0.599,
52
  "Human Eval (Structure) ⬆️":0.486,
53
  "Human Eval (Text Alignment) ⬆️":0.689,
54
  "Human Eval (Avg.) ⬆️":0.591,