hysts HF staff commited on
Commit
583aa15
1 Parent(s): e5b6fb9
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ backbone:
3
+ - diffusion
4
+ domain:
5
+ - multi-modal
6
+ frameworks:
7
+ - pytorch
8
+ license: cc-by-nc-nd-4.0
9
+ metrics:
10
+ - realism
11
+ - video-video similarity
12
+ studios:
13
+ - damo/Video-to-Video
14
+ tags:
15
+ - video2video generation
16
+ - diffusion model
17
+ - 视频到视频
18
+ - 视频超分辨率
19
+ - 视频生成视频
20
+ - 生成
21
+ tasks:
22
+ - video-to-video
23
+ widgets:
24
+ - examples:
25
+ - inputs:
26
+ - data: A panda eating bamboo on a rock.
27
+ name: text
28
+ - data: XXX/test.mpt
29
+ name: video_path
30
+ name: 2
31
+ title: 示例1
32
+ inferencespec:
33
+ cpu: 4
34
+ gpu: 1
35
+ gpu_memory: 28000
36
+ memory: 32000
37
+ inputs:
38
+ - name: text, video_path
39
+ title: 输入英文prompt, 视频路径
40
+ type: str, str
41
+ validator:
42
+ max_words: 75, /
43
+ task: video-to-video
44
+ ---
45
+
46
+ # Video-to-Video
47
+
48
+ 本项目**MS-Vid2Vid**由达摩院研发和训练,主要用于提升文生视频、图生视频的分辨率和时空连续性,其训练数据包含了精选的海量的高清视频、图像数据(最短边>720),可以将低分辨率的(16:9)的视频提升到更高分辨率(1280 * 720),可以用于任意低分辨率的的超分,本页面我们将称之为**MS-Vid2Vid-XL**。
49
+
50
+
51
+ The **MS-Vid2Vid** project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as **MS-Vid2Vid-XL**.
52
+
53
+ <center>
54
+ <p align="center">
55
+ <img src="assets/images/Fig_1.png"/>
56
+ <br/>
57
+ Fig.1 Video-to-Video-XL
58
+ <p></center>
59
+
60
+
61
+
62
+ ## 模型介绍 (Introduction)
63
+
64
+
65
+ **MS-Vid2VidL**是基于Stable Diffusion设计而得,其设计细节延续我们自研[VideoComposer](https://videocomposer.github.io),具体可以参考其技术报告。如下示例中,左边是低分(448 * 256),细节会存在抖动,时序一致性较差
66
+ 右边是高分(1280 * 720),总体会平滑很多,在很多case具有较强的修正能力。
67
+
68
+
69
+ **MS-Vid2Vid-XL** is designed based on Stable Diffusion, with design details inherited from our in-house [VideoComposer](https://videocomposer.github.io). For specific information, please refer to our technical report.
70
+
71
+
72
+ <center>
73
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424496410559.mp4"></video>
74
+ </center>
75
+ <br />
76
+ <center>
77
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424814395007.mp4"></video>
78
+ </center>
79
+ <br />
80
+ <center>
81
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424166441720.mp4"></video>
82
+ </center>
83
+ <br />
84
+ <center>
85
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424151609672.mp4"></video>
86
+ </center>
87
+ <br />
88
+ <center>
89
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741042.mp4"></video>
90
+ </center>
91
+ <br />
92
+ <center>
93
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741043.mp4"></video>
94
+ </center>
95
+ <br />
96
+ <center>
97
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424160549937.mp4"></video>
98
+ </center>
99
+ <br />
100
+ <center>
101
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423819156083.mp4"></video>
102
+ </center>
103
+ <br />
104
+ <center>
105
+ <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423826392315.mp4"></video>
106
+ </center>
107
+
108
+
109
+ ### 代码范例 (Code example)
110
+ ```python
111
+ from modelscope.pipelines import pipeline
112
+ from modelscope.outputs import OutputKeys
113
+
114
+ # VID_PATH: your video path
115
+ # TEXT : your text description
116
+ pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
117
+ p_input = {
118
+ 'video_path': VID_PATH,
119
+ 'text': TEXT
120
+ }
121
+
122
+ output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
123
+ ```
124
+
125
+
126
+
127
+ ### 模型局限 (Limitation)
128
+
129
+ 本**MS-Vid2Vid-XL**可能存在如下可能局限性:
130
+
131
+ - 目标距离较远时可能会存在一定的模糊,该问题可以通过输入文本来解决或缓解;
132
+ - 计算时耗大,因为需要生成720P的视频,隐空间的尺寸为(160 * 90),单个视频计算时长>2分钟
133
+ - 目前仅支持英文,因为训练数据的原因目前仅支持英文输入
134
+
135
+
136
+ This **MS-Vid2Vid-XL** may have the following limitations:
137
+ - There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
138
+ - Computation time is high due to the need to generate 720P videos. The latent space size is (160 * 90), and the computation time for a single video is more than 2 minutes.
139
+ - Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.
140
+
141
+
142
+
143
+ ## 相关论文以及引用信息 (Reference)
144
+ ```
145
+ @article{videocomposer2023,
146
+ title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
147
+ author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
148
+ journal={arXiv preprint arXiv:2306.02018},
149
+ year={2023}
150
+ }
151
+
152
+ @inproceedings{videofusion2023,
153
+ title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
154
+ author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
155
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
156
+ year={2023}
157
+ }
158
+ ```
159
+
160
+
161
+
162
+ ## 使用协议 (License Agreement)
163
+ 我们的代码和模型权重仅可用于个人/学术研究,暂不支持商用。
164
+
165
+ Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.
166
+
assets/images/Fig_1.png ADDED

Git LFS Details

  • SHA256: 5d1323224e3678b89064c78db5d4096dd729603258396868734b15a90fb0da6d
  • Pointer size: 131 Bytes
  • Size of remote file: 733 kB
configuration.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "framework": "pytorch",
3
+ "task": "video-to-video",
4
+ "model": {
5
+ "type": "video-to-video-model",
6
+ "model_args": {
7
+ "ckpt_clip": "open_clip_pytorch_model.bin",
8
+ "ckpt_unet": "non_ema_0035000.pth",
9
+ "ckpt_autoencoder": "v2-1_512-ema-pruned.ckpt",
10
+ "seed": 666,
11
+ "solver_mode": "fast"
12
+ },
13
+ "model_cfg": {
14
+ "batch_size": 1,
15
+ "target_fps": 8,
16
+ "max_frames": 32,
17
+ "latent_hei": 90,
18
+ "latent_wid": 160
19
+ }
20
+ },
21
+ "pipeline": {
22
+ "type": "video-to-video-pipeline"
23
+ }
24
+ }
non_ema_0035000.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d146dd22a8158896c882dd96e6b14d2962a63398a3f2ac37611dcadcdab3a15d
3
+ size 5645549113
open_clip_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a78ef8e8c73fd0df621682e7a8e8eb36c6916cb3c16b291a082ecd52ab79cc4
3
+ size 3944692325
v2-1_512-ema-pruned.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88ecb782561455673c4b78d05093494b9c539fc6bfc08f3a9a4a0dd7b0b10f36
3
+ size 5214865159