zeinshaheen commited on
Commit
243506f
1 Parent(s): f278da2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -7,7 +7,7 @@ This repository is the official implementation of Kandinsky Video 1.1 model.
7
  [![Hugging Face Spaces](https://img.shields.io/badge/🤗-Huggingface-yello.svg)](https://huggingface.co/ai-forever/KandinskyVideo) | [Telegram-bot](https://t.me/video_kandinsky_bot) | [Habr post](https://habr.com/ru/companies/sberbank/articles/775554/) | [Our text-to-image model](https://github.com/ai-forever/Kandinsky-3/tree/main)
8
 
9
  <p>
10
- <!-- <img src="__assets__/title.jpg" width="800px"/> -->
11
  <!-- <br> -->
12
  Our <B>previous</B> model <a href="https://ai-forever.github.io/Kandinsky-3/">Kandinsky Video 1.0</a>, divides the video generation process into two stages: initially generating keyframes at a low FPS and then creating interpolated frames between these keyframes to increase the FPS. In <B>Kandinsky Video 1.1</B>, we further break down the keyframe generation into two extra steps: first, generating the initial frame of the video from the textual prompt using Text to Image <a href="https://github.com/ai-forever/Kandinsky-3">Kandinsky 3.0</a>, and then generating the subsequent keyframes based on the textual prompt and the previously generated first frame. This approach ensures more consistent content across the frames and significantly enhances the overall video quality. Furthermore, the approach allows animating any input image as an additional feature.
13
  </p>
@@ -17,7 +17,7 @@ Our <B>previous</B> model <a href="https://ai-forever.github.io/Kandinsky-3/">Ka
17
  ## Pipeline
18
 
19
  <p align="center">
20
- <img src="__assets__/pipeline.png" width="800px"/>
21
  <br>
22
  <em>In the <a href="https://ai-forever.github.io/Kandinsky-3/">Kandinsky Video 1.0</a>, the encoded text prompt enters the text-to-video U-Net3D keyframe generation model with temporal layers or blocks, and then the sampled latent keyframes are sent to the latent interpolation model to predict three interpolation frames between
23
  two keyframes. An image MoVQ-GAN decoder is used to obtain the final video result. In <B>Kandinsky Video 1.1</B>, text-to-video U-Net3D is also conditioned on text-to-image U-Net2D, which helps to improve the content quality. A temporal MoVQ-GAN decoder is used to decode the final video.</em>
@@ -59,7 +59,7 @@ video = t2v_pipe(
59
  guidance_weight_image=3.0,
60
  )
61
 
62
- path_to_save = f'./__assets__/video.gif'
63
  video[0].save(
64
  path_to_save,
65
  save_all=True, append_images=video[1:], duration=int(5500/len(video)), loop=0
@@ -67,7 +67,7 @@ video[0].save(
67
  ```
68
 
69
  <p align="center">
70
- <img src="__assets__/video.gif" raw=true>
71
  <br><em>Generated video</em>
72
  </p>
73
 
@@ -104,7 +104,7 @@ video = t2v_pipe(
104
  guidance_weight_image=3.0,
105
  )
106
 
107
- path_to_save = f'./__assets__/video2.gif'
108
  video[0].save(
109
  path_to_save,
110
  save_all=True, append_images=video[1:], duration=int(5500/len(video)), loop=0
@@ -117,7 +117,7 @@ video[0].save(
117
  </p>
118
 
119
  <p align="center">
120
- <img src="__assets__/video2.gif"><br>
121
  <em>Generated Video.</em>
122
  </p>
123
 
@@ -125,21 +125,21 @@ video[0].save(
125
  ## Results
126
 
127
  <p align="center">
128
- <img src="__assets__/eval crafter.png" align="center" width="50%">
129
  <br>
130
  <em> Kandinsky Video 1.1 achieves second place overall and best open source model on <a href="https://evalcrafter.github.io/">EvalCrafter</a> text to video benchmark. VQ: visual quality, TVA: text-video alignment, MQ: motion quality, TC: temporal consistency and FAS: final average score.
131
  </em>
132
  </p>
133
 
134
  <p align="center">
135
- <img src="__assets__/polygon.png" raw=true align="center" width="50%">
136
  <br>
137
  <em> Polygon-radar chart representing the performance of Kandinsky Video 1.1 on <a href="https://evalcrafter.github.io/">EvalCrafter</a> benchmark.
138
  </em>
139
  </p>
140
 
141
  <p align="center">
142
- <img src="__assets__/human eval.png" raw=true align="center" width="50%">
143
  <br>
144
  <em> Human evaluation study results. The bars in the plot correspond to the percentage of “wins” in the side-by-side comparison of model generations. We compare our model with <a href="https://arxiv.org/abs/2304.08818">Video LDM</a>.
145
  </em>
 
7
  [![Hugging Face Spaces](https://img.shields.io/badge/🤗-Huggingface-yello.svg)](https://huggingface.co/ai-forever/KandinskyVideo) | [Telegram-bot](https://t.me/video_kandinsky_bot) | [Habr post](https://habr.com/ru/companies/sberbank/articles/775554/) | [Our text-to-image model](https://github.com/ai-forever/Kandinsky-3/tree/main)
8
 
9
  <p>
10
+ <!-- <img src="_assets__/title.jpg" width="800px"/> -->
11
  <!-- <br> -->
12
  Our <B>previous</B> model <a href="https://ai-forever.github.io/Kandinsky-3/">Kandinsky Video 1.0</a>, divides the video generation process into two stages: initially generating keyframes at a low FPS and then creating interpolated frames between these keyframes to increase the FPS. In <B>Kandinsky Video 1.1</B>, we further break down the keyframe generation into two extra steps: first, generating the initial frame of the video from the textual prompt using Text to Image <a href="https://github.com/ai-forever/Kandinsky-3">Kandinsky 3.0</a>, and then generating the subsequent keyframes based on the textual prompt and the previously generated first frame. This approach ensures more consistent content across the frames and significantly enhances the overall video quality. Furthermore, the approach allows animating any input image as an additional feature.
13
  </p>
 
17
  ## Pipeline
18
 
19
  <p align="center">
20
+ <img src="_assets__/pipeline.png" width="800px"/>
21
  <br>
22
  <em>In the <a href="https://ai-forever.github.io/Kandinsky-3/">Kandinsky Video 1.0</a>, the encoded text prompt enters the text-to-video U-Net3D keyframe generation model with temporal layers or blocks, and then the sampled latent keyframes are sent to the latent interpolation model to predict three interpolation frames between
23
  two keyframes. An image MoVQ-GAN decoder is used to obtain the final video result. In <B>Kandinsky Video 1.1</B>, text-to-video U-Net3D is also conditioned on text-to-image U-Net2D, which helps to improve the content quality. A temporal MoVQ-GAN decoder is used to decode the final video.</em>
 
59
  guidance_weight_image=3.0,
60
  )
61
 
62
+ path_to_save = f'./_assets__/video.gif'
63
  video[0].save(
64
  path_to_save,
65
  save_all=True, append_images=video[1:], duration=int(5500/len(video)), loop=0
 
67
  ```
68
 
69
  <p align="center">
70
+ <img src="_assets__/video.gif" raw=true>
71
  <br><em>Generated video</em>
72
  </p>
73
 
 
104
  guidance_weight_image=3.0,
105
  )
106
 
107
+ path_to_save = f'./_assets__/video2.gif'
108
  video[0].save(
109
  path_to_save,
110
  save_all=True, append_images=video[1:], duration=int(5500/len(video)), loop=0
 
117
  </p>
118
 
119
  <p align="center">
120
+ <img src="_assets__/video2.gif"><br>
121
  <em>Generated Video.</em>
122
  </p>
123
 
 
125
  ## Results
126
 
127
  <p align="center">
128
+ <img src="_assets__/eval crafter.png" align="center" width="50%">
129
  <br>
130
  <em> Kandinsky Video 1.1 achieves second place overall and best open source model on <a href="https://evalcrafter.github.io/">EvalCrafter</a> text to video benchmark. VQ: visual quality, TVA: text-video alignment, MQ: motion quality, TC: temporal consistency and FAS: final average score.
131
  </em>
132
  </p>
133
 
134
  <p align="center">
135
+ <img src="_assets__/polygon.png" raw=true align="center" width="50%">
136
  <br>
137
  <em> Polygon-radar chart representing the performance of Kandinsky Video 1.1 on <a href="https://evalcrafter.github.io/">EvalCrafter</a> benchmark.
138
  </em>
139
  </p>
140
 
141
  <p align="center">
142
+ <img src="_assets__/human eval.png" raw=true align="center" width="50%">
143
  <br>
144
  <em> Human evaluation study results. The bars in the plot correspond to the percentage of “wins” in the side-by-side comparison of model generations. We compare our model with <a href="https://arxiv.org/abs/2304.08818">Video LDM</a>.
145
  </em>