Add files
Browse files- .gitattributes +1 -0
- README.md +166 -0
- assets/images/Fig_1.png +3 -0
- configuration.json +24 -0
- non_ema_0035000.pth +3 -0
- open_clip_pytorch_model.bin +3 -0
- v2-1_512-ema-pruned.ckpt +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
backbone:
|
3 |
+
- diffusion
|
4 |
+
domain:
|
5 |
+
- multi-modal
|
6 |
+
frameworks:
|
7 |
+
- pytorch
|
8 |
+
license: cc-by-nc-nd-4.0
|
9 |
+
metrics:
|
10 |
+
- realism
|
11 |
+
- video-video similarity
|
12 |
+
studios:
|
13 |
+
- damo/Video-to-Video
|
14 |
+
tags:
|
15 |
+
- video2video generation
|
16 |
+
- diffusion model
|
17 |
+
- 视频到视频
|
18 |
+
- 视频超分辨率
|
19 |
+
- 视频生成视频
|
20 |
+
- 生成
|
21 |
+
tasks:
|
22 |
+
- video-to-video
|
23 |
+
widgets:
|
24 |
+
- examples:
|
25 |
+
- inputs:
|
26 |
+
- data: A panda eating bamboo on a rock.
|
27 |
+
name: text
|
28 |
+
- data: XXX/test.mpt
|
29 |
+
name: video_path
|
30 |
+
name: 2
|
31 |
+
title: 示例1
|
32 |
+
inferencespec:
|
33 |
+
cpu: 4
|
34 |
+
gpu: 1
|
35 |
+
gpu_memory: 28000
|
36 |
+
memory: 32000
|
37 |
+
inputs:
|
38 |
+
- name: text, video_path
|
39 |
+
title: 输入英文prompt, 视频路径
|
40 |
+
type: str, str
|
41 |
+
validator:
|
42 |
+
max_words: 75, /
|
43 |
+
task: video-to-video
|
44 |
+
---
|
45 |
+
|
46 |
+
# Video-to-Video
|
47 |
+
|
48 |
+
本项目**MS-Vid2Vid**由达摩院研发和训练,主要用于提升文生视频、图生视频的分辨率和时空连续性,其训练数据包含了精选的海量的高清视频、图像数据(最短边>720),可以将低分辨率的(16:9)的视频提升到更高分辨率(1280 * 720),可以用于任意低分辨率的的超分,本页面我们将称之为**MS-Vid2Vid-XL**。
|
49 |
+
|
50 |
+
|
51 |
+
The **MS-Vid2Vid** project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as **MS-Vid2Vid-XL**.
|
52 |
+
|
53 |
+
<center>
|
54 |
+
<p align="center">
|
55 |
+
<img src="assets/images/Fig_1.png"/>
|
56 |
+
<br/>
|
57 |
+
Fig.1 Video-to-Video-XL
|
58 |
+
<p></center>
|
59 |
+
|
60 |
+
|
61 |
+
|
62 |
+
## 模型介绍 (Introduction)
|
63 |
+
|
64 |
+
|
65 |
+
**MS-Vid2VidL**是基于Stable Diffusion设计而得,其设计细节延续我们自研[VideoComposer](https://videocomposer.github.io),具体可以参考其技术报告。如下示例中,左边是低分(448 * 256),细节会存在抖动,时序一致性较差
|
66 |
+
右边是高分(1280 * 720),总体会平滑很多,在很多case具有较强的修正能力。
|
67 |
+
|
68 |
+
|
69 |
+
**MS-Vid2Vid-XL** is designed based on Stable Diffusion, with design details inherited from our in-house [VideoComposer](https://videocomposer.github.io). For specific information, please refer to our technical report.
|
70 |
+
|
71 |
+
|
72 |
+
<center>
|
73 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424496410559.mp4"></video>
|
74 |
+
</center>
|
75 |
+
<br />
|
76 |
+
<center>
|
77 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424814395007.mp4"></video>
|
78 |
+
</center>
|
79 |
+
<br />
|
80 |
+
<center>
|
81 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424166441720.mp4"></video>
|
82 |
+
</center>
|
83 |
+
<br />
|
84 |
+
<center>
|
85 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424151609672.mp4"></video>
|
86 |
+
</center>
|
87 |
+
<br />
|
88 |
+
<center>
|
89 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741042.mp4"></video>
|
90 |
+
</center>
|
91 |
+
<br />
|
92 |
+
<center>
|
93 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741043.mp4"></video>
|
94 |
+
</center>
|
95 |
+
<br />
|
96 |
+
<center>
|
97 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424160549937.mp4"></video>
|
98 |
+
</center>
|
99 |
+
<br />
|
100 |
+
<center>
|
101 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423819156083.mp4"></video>
|
102 |
+
</center>
|
103 |
+
<br />
|
104 |
+
<center>
|
105 |
+
<video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423826392315.mp4"></video>
|
106 |
+
</center>
|
107 |
+
|
108 |
+
|
109 |
+
### 代码范例 (Code example)
|
110 |
+
```python
|
111 |
+
from modelscope.pipelines import pipeline
|
112 |
+
from modelscope.outputs import OutputKeys
|
113 |
+
|
114 |
+
# VID_PATH: your video path
|
115 |
+
# TEXT : your text description
|
116 |
+
pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
|
117 |
+
p_input = {
|
118 |
+
'video_path': VID_PATH,
|
119 |
+
'text': TEXT
|
120 |
+
}
|
121 |
+
|
122 |
+
output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
|
123 |
+
```
|
124 |
+
|
125 |
+
|
126 |
+
|
127 |
+
### 模型局限 (Limitation)
|
128 |
+
|
129 |
+
本**MS-Vid2Vid-XL**可能存在如下可能局限性:
|
130 |
+
|
131 |
+
- 目标距离较远时可能会存在一定的模糊,该问题可以通过输入文本来解决或缓解;
|
132 |
+
- 计算时耗大,因为需要生成720P的视频,隐空间的尺寸为(160 * 90),单个视频计算时长>2分钟
|
133 |
+
- 目前仅支持英文,因为训练数据的原因目前仅支持英文输入
|
134 |
+
|
135 |
+
|
136 |
+
This **MS-Vid2Vid-XL** may have the following limitations:
|
137 |
+
- There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
|
138 |
+
- Computation time is high due to the need to generate 720P videos. The latent space size is (160 * 90), and the computation time for a single video is more than 2 minutes.
|
139 |
+
- Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.
|
140 |
+
|
141 |
+
|
142 |
+
|
143 |
+
## 相关论文以及引用信息 (Reference)
|
144 |
+
```
|
145 |
+
@article{videocomposer2023,
|
146 |
+
title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
|
147 |
+
author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
|
148 |
+
journal={arXiv preprint arXiv:2306.02018},
|
149 |
+
year={2023}
|
150 |
+
}
|
151 |
+
|
152 |
+
@inproceedings{videofusion2023,
|
153 |
+
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
|
154 |
+
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
|
155 |
+
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
|
156 |
+
year={2023}
|
157 |
+
}
|
158 |
+
```
|
159 |
+
|
160 |
+
|
161 |
+
|
162 |
+
## 使用协议 (License Agreement)
|
163 |
+
我们的代码和模型权重仅可用于个人/学术研究,暂不支持商用。
|
164 |
+
|
165 |
+
Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.
|
166 |
+
|
assets/images/Fig_1.png
ADDED
Git LFS Details
|
configuration.json
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"framework": "pytorch",
|
3 |
+
"task": "video-to-video",
|
4 |
+
"model": {
|
5 |
+
"type": "video-to-video-model",
|
6 |
+
"model_args": {
|
7 |
+
"ckpt_clip": "open_clip_pytorch_model.bin",
|
8 |
+
"ckpt_unet": "non_ema_0035000.pth",
|
9 |
+
"ckpt_autoencoder": "v2-1_512-ema-pruned.ckpt",
|
10 |
+
"seed": 666,
|
11 |
+
"solver_mode": "fast"
|
12 |
+
},
|
13 |
+
"model_cfg": {
|
14 |
+
"batch_size": 1,
|
15 |
+
"target_fps": 8,
|
16 |
+
"max_frames": 32,
|
17 |
+
"latent_hei": 90,
|
18 |
+
"latent_wid": 160
|
19 |
+
}
|
20 |
+
},
|
21 |
+
"pipeline": {
|
22 |
+
"type": "video-to-video-pipeline"
|
23 |
+
}
|
24 |
+
}
|
non_ema_0035000.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d146dd22a8158896c882dd96e6b14d2962a63398a3f2ac37611dcadcdab3a15d
|
3 |
+
size 5645549113
|
open_clip_pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9a78ef8e8c73fd0df621682e7a8e8eb36c6916cb3c16b291a082ecd52ab79cc4
|
3 |
+
size 3944692325
|
v2-1_512-ema-pruned.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:88ecb782561455673c4b78d05093494b9c539fc6bfc08f3a9a4a0dd7b0b10f36
|
3 |
+
size 5214865159
|