File size: 4,496 Bytes
a41f8bb ac7fbae b8ad6ac a41f8bb d5a90bd ac7fbae dc9e321 7eb6dce dc9e321 959c18b dc9e321 b8ad6ac ac7fbae b8ad6ac 7ecc308 b8ad6ac 6961f66 b8ad6ac 1c62eb5 b8ad6ac 14f8eae b8ad6ac ac7fbae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
license: cc-by-nc-4.0
pipeline_tag: text-to-video
---
The original repo is [here](https://modelscope.cn/models/damo/text-to-video-synthesis/summary).
We Are Hiring! (Based on Beijing / Hangzhou, China.)
If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.
EMAIL: wangjiuniu.wjn@alibaba-inc.com
This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.
## Model Description
The text-to-video generation diffusion model consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. The overall model parameters are about 1.7 billion. Support English input. The diffusion model adopts the Unet3D structure, and realizes the function of video generation through the iterative denoising process from the pure Gaussian noise video.
**This model is meant for research purposes. Please look at the [model limitations and biases](#model-limitations-and-biases) and [misuse, malicious use and excessive use](#misuse-malicious-use-and-excessive-use) sections.**
**How to expect the model to be used and where it is applicable**
This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions.
## How to use
Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input must be in dictionary format, the legal key value is 'text', and the content is a short text. This model currently only supports inference on the GPU. Enter specific code examples as follows:
For Colab usage, you can view [this webpage](https://colab.research.google.com/drive/1uW1ZqswkQ9Z9bp5Nbo5z59cAn7I0hE6R?usp=sharing).
### Operating environment (Python Package)
```
pip install git+https://github.com/modelscope/modelscope.git
pip install open_clip_torch
pip install pytorch-lightning
```
### Code example (Demo Code)
```python
from huggingface_hub import snapshot_download
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
import pathlib
model_dir = pathlib.Path('weights')
snapshot_download('damo-vilab/modelscope-damo-text-to-video-synthesis',
repo_type='model', local_dir=model_dir)
pipe = pipeline('text-to-video-synthesis', model_dir.as_posix())
test_text = {
'text': 'A panda eating bamboo on a rock.',
}
output_video_path = pipe(test_text,)[OutputKeys.OUTPUT_VIDEO]
print('output_video_path:', output_video_path)
```
### View results
The above code will display the save path of the output video, and the current encoding format can be played normally with [VLC player](https://www.videolan.org/vlc/).
The output mp4 file can be viewed by [VLC media player](https://www.videolan.org/vlc/). Some other media players may not view it normally.
## Model limitations and biases
* The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
* This model cannot achieve perfect film and television quality generation.
* The model cannot generate clear text.
* The model is mainly trained with English corpus and does not support other languages at the moment**.
* The performance of this model needs to be improved on complex compositional generation tasks.
## Misuse, Malicious Use and Excessive Use
* The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.
* It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
* Prohibited for pornographic, violent and bloody content generation.
* Prohibited for error and false information generation.
## Training data
The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B-en), [ImageNet](https://www.image-net.org/), [Webvid](https://m-bain.github.io/webvid-dataset/) and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.
|