Add files

Browse files

Files changed (7) hide show

.gitattributes +1 -0
README.md +166 -0
assets/images/Fig_1.png +3 -0
configuration.json +24 -0
non_ema_0035000.pth +3 -0
open_clip_pytorch_model.bin +3 -0
v2-1_512-ema-pruned.ckpt +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,166 @@

+---
+backbone:
+- diffusion
+domain:
+- multi-modal
+frameworks:
+- pytorch
+license: cc-by-nc-nd-4.0
+metrics:
+- realism
+- video-video similarity
+studios:
+- damo/Video-to-Video
+tags:
+- video2video generation
+- diffusion model
+- 视频到视频
+- 视频超分辨率
+- 视频生成视频
+- 生成
+tasks:
+- video-to-video
+widgets:
+  - examples:
+      - inputs:
+          - data: A panda eating bamboo on a rock.
+            name: text
+          - data: XXX/test.mpt
+            name: video_path
+        name: 2
+        title: 示例1
+    inferencespec:
+      cpu: 4
+      gpu: 1
+      gpu_memory: 28000
+      memory: 32000
+    inputs:
+      - name: text, video_path
+        title: 输入英文prompt, 视频路径
+        type: str, str
+        validator:
+          max_words: 75, /
+    task: video-to-video
+---
+# Video-to-Video
+本项目**MS-Vid2Vid**由达摩院研发和训练，主要用于提升文生视频、图生视频的分辨率和时空连续性，其训练数据包含了精选的海量的高清视频、图像数据（最短边>720），可以将低分辨率的(16:9)的视频提升到更高分辨率（1280 * 720），可以用于任意低分辨率的的超分，本页面我们将称之为**MS-Vid2Vid-XL**。
+The **MS-Vid2Vid** project is developed and trained by Damo Academy and is primarily used to enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos. The training data consists of a large selection of high-definition videos and image data (with a minimum short side length of 720), which can upscale low-resolution (16:9) videos to higher resolutions (1280 * 720). It can be used for arbitrary low-resolution super-resolution tasks. On this page, we refer to it as **MS-Vid2Vid-XL**.
+<center>
+<p align="center">
+    <img src="assets/images/Fig_1.png"/>
+    <br/>
+    Fig.1 Video-to-Video-XL
+<p></center>
+## 模型介绍 (Introduction)
+**MS-Vid2VidL**是基于Stable Diffusion设计而得，其设计细节延续我们自研[VideoComposer](https://videocomposer.github.io)，具体可以参考其技术报告。如下示例中，左边是低分(448 * 256)，细节会存在抖动，时序一致性较差
+右边是高分(1280 * 720)，总体会平滑很多，在很多case具有较强的修正能力。
+**MS-Vid2Vid-XL** is designed based on Stable Diffusion, with design details inherited from our in-house [VideoComposer](https://videocomposer.github.io). For specific information, please refer to our technical report.
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424496410559.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424814395007.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424166441720.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424151609672.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741042.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424162741043.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/424160549937.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423819156083.mp4"></video>
+</center>
+<br />
+<center>
+    <video muted="true" autoplay="true" loop="true" height="288" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/423826392315.mp4"></video>
+</center>
+### 代码范例 (Code example)
+```python
+from modelscope.pipelines import pipeline
+from modelscope.outputs import OutputKeys
+# VID_PATH: your video path
+# TEXT : your text description
+pipe = pipeline(task="video-to-video", model='damo/Video-to-Video')
+p_input = {
+            'video_path': VID_PATH,
+            'text': TEXT
+        }
+output_video_path = pipe(p_input, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
+```
+### 模型局限 (Limitation)
+本**MS-Vid2Vid-XL**可能存在如下可能局限性：
+- 目标距离较远时可能会存在一定的模糊，该问题可以通过输入文本来解决或缓解；
+- 计算时耗大，因为需要生成720P的视频，隐空间的尺寸为(160 * 90)，单个视频计算时长>2分钟
+- 目前仅支持英文，因为训练数据的原因目前仅支持英文输入
+This **MS-Vid2Vid-XL** may have the following limitations:
+- There may be some blurriness when the target is far away. This issue can be addressed by providing input text.
+- Computation time is high due to the need to generate 720P videos. The latent space size is (160 * 90), and the computation time for a single video is more than 2 minutes.
+- Currently, it only supports English. This is due to the training data, which is limited to English inputs at the moment.
+## 相关论文以及引用信息 (Reference)
+```
+@article{videocomposer2023,
+  title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
+  author={Wang, Xiang* and Yuan, Hangjie* and Zhang, Shiwei* and Chen, Dayou* and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
+  journal={arXiv preprint arXiv:2306.02018},
+  year={2023}
+}
+@inproceedings{videofusion2023,
+  title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
+  author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year={2023}
+}
+```
+## 使用协议 (License Agreement)
+我们的代码和模型权重仅可用于个人/学术研究，暂不支持商用。
+Our code and model weights are only available for personal/academic research use and are currently not supported for commercial use.

assets/images/Fig_1.png ADDED Viewed

Git LFS Details

SHA256: 5d1323224e3678b89064c78db5d4096dd729603258396868734b15a90fb0da6d
Pointer size: 131 Bytes
Size of remote file: 733 kB

configuration.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "framework": "pytorch",
+    "task": "video-to-video",
+    "model": {
+        "type": "video-to-video-model",
+        "model_args": {
+            "ckpt_clip": "open_clip_pytorch_model.bin",
+            "ckpt_unet": "non_ema_0035000.pth",
+            "ckpt_autoencoder": "v2-1_512-ema-pruned.ckpt",
+            "seed": 666,
+	    "solver_mode": "fast"
+        },
+        "model_cfg": {
+	    "batch_size": 1,
+            "target_fps": 8,
+            "max_frames": 32,
+            "latent_hei": 90,
+            "latent_wid": 160
+        }
+    },
+    "pipeline": {
+        "type": "video-to-video-pipeline"
+    }
+}

non_ema_0035000.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d146dd22a8158896c882dd96e6b14d2962a63398a3f2ac37611dcadcdab3a15d
+size 5645549113

open_clip_pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a78ef8e8c73fd0df621682e7a8e8eb36c6916cb3c16b291a082ecd52ab79cc4
+size 3944692325

v2-1_512-ema-pruned.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88ecb782561455673c4b78d05093494b9c539fc6bfc08f3a9a4a0dd7b0b10f36
+size 5214865159