ved-genmo commited on
Commit
3c268ef
1 Parent(s): db56120

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -27
README.md CHANGED
@@ -9,10 +9,13 @@ pipeline_tag: text-to-video
9
  library_name: diffusers
10
  ---
11
 
12
- # Mochi 1 Preview
13
- A state-of-the-art video generation model by [Genmo](https://genmo.ai).
14
 
15
- ![Grid](assets/grid.gif)
 
 
 
 
 
16
 
17
  ## Overview
18
 
@@ -20,60 +23,94 @@ Mochi 1 preview is an open state-of-the-art video generation model with high-fid
20
 
21
  ## Installation
22
 
23
- Clone the repository and install it in editable mode:
24
-
25
  Install using [uv](https://github.com/astral-sh/uv):
26
 
27
  ```bash
28
- git clone https://github.com/genmoai/mochi
29
- cd mochi
30
  pip install uv
31
  uv venv .venv
32
  source .venv/bin/activate
33
- uv pip install -e .
 
 
 
 
 
 
34
  ```
35
 
 
 
36
  ## Download Weights
37
 
38
- Download the weights from [Hugging Face](https://huggingface.co/genmo/mochi-1-preview/tree/main) or via `magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce`.
 
 
 
 
 
39
 
40
  ## Running
41
 
42
  Start the gradio UI with
43
 
44
  ```bash
45
- python3 -m mochi_preview.gradio_ui --model_dir "<path_to_model_directory>"
46
  ```
47
 
48
  Or generate videos directly from the CLI with
49
 
50
  ```bash
51
- python3 -m mochi_preview.infer --prompt "A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and sprigs of mint against a peach-colored background. The hand gently tosses the lemon up and catches it, showcasing its smooth texture. A beige string bag sits beside the bowl, adding a rustic touch to the scene. Additional lemons, one halved, are scattered around the base of the bowl. The even lighting enhances the vibrant colors and creates a fresh, inviting atmosphere." --seed 1710977262 --cfg_scale 4.5 --model_dir "<path_to_model_directory>"
52
  ```
53
 
54
- Replace `<path_to_model_directory>` with the path to your model directory.
55
-
56
- ## Model Architecture
57
 
58
- Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
59
 
60
- Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 96x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
61
 
62
- An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
63
- Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
64
-
65
- ## Hardware Requirements
66
-
67
- Mochi 1 supports a variety of hardware platforms depending on quantization level, ranging from a single 3090 GPU up to multiple H100 GPUs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- ## Safety
70
- Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
71
-
72
- ## Limitations
73
- Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
74
 
75
  ## Running with Diffusers
76
 
 
 
77
  Install the latest version of Diffusers
78
 
79
  ```shell
@@ -124,6 +161,40 @@ export_to_video(frames, "mochi.mp4", fps=30)
124
 
125
  To learn more check out the [Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi) documentation
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ## BibTeX
128
  ```
129
  @misc{genmo2024mochi,
 
9
  library_name: diffusers
10
  ---
11
 
 
 
12
 
13
+ # Mochi 1
14
+ [Blog](https://www.genmo.ai/blog) | [Hugging Face](https://huggingface.co/genmo/mochi-1-preview) | [Playground](https://www.genmo.ai/play) | [Careers](https://jobs.ashbyhq.com/genmo)
15
+
16
+ A state of the art video generation model by [Genmo](https://genmo.ai).
17
+
18
+ https://github.com/user-attachments/assets/4d268d02-906d-4cb0-87cc-f467f1497108
19
 
20
  ## Overview
21
 
 
23
 
24
  ## Installation
25
 
 
 
26
  Install using [uv](https://github.com/astral-sh/uv):
27
 
28
  ```bash
29
+ git clone https://github.com/genmoai/models
30
+ cd models
31
  pip install uv
32
  uv venv .venv
33
  source .venv/bin/activate
34
+ uv pip install setuptools
35
+ uv pip install -e . --no-build-isolation
36
+ ```
37
+
38
+ If you want to install flash attention, you can use:
39
+ ```
40
+ uv pip install -e .[flash] --no-build-isolation
41
  ```
42
 
43
+ You will also need to install [FFMPEG](https://www.ffmpeg.org/) to turn your outputs into videos.
44
+
45
  ## Download Weights
46
 
47
+ Use [download_weights.py](scripts/download_weights.py) to download the model + decoder to a local directory. Use it like this:
48
+ ```
49
+ python3 ./scripts/download_weights.py <path_to_downloaded_directory>
50
+ ```
51
+
52
+ Or, directly download the weights from [Hugging Face](https://huggingface.co/genmo/mochi-1-preview/tree/main) or via `magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce` to a folder on your computer.
53
 
54
  ## Running
55
 
56
  Start the gradio UI with
57
 
58
  ```bash
59
+ python3 ./demos/gradio_ui.py --model_dir "<path_to_downloaded_directory>"
60
  ```
61
 
62
  Or generate videos directly from the CLI with
63
 
64
  ```bash
65
+ python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>"
66
  ```
67
 
68
+ Replace `<path_to_downloaded_directory>` with the path to your model directory.
 
 
69
 
70
+ ## API
71
 
72
+ This repository comes with a simple, composable API, so you can programmatically call the model. **This API gives the highest quality results.** You can find a full example [here](demos/api_example.py). But, roughly, it looks like this:
73
 
74
+ ```python
75
+ from genmo.mochi_preview.pipelines import (
76
+ DecoderModelFactory,
77
+ DitModelFactory,
78
+ MochiSingleGPUPipeline,
79
+ T5ModelFactory,
80
+ linear_quadratic_schedule,
81
+ )
82
+
83
+ pipeline = MochiSingleGPUPipeline(
84
+ text_encoder_factory=T5ModelFactory(),
85
+ dit_factory=DitModelFactory(
86
+ model_path=f"{MOCHI_DIR}/dit.safetensors", model_dtype="bf16"
87
+ ),
88
+ decoder_factory=DecoderModelFactory(
89
+ model_path=f"{MOCHI_DIR}/vae.safetensors",
90
+ ),
91
+ cpu_offload=True,
92
+ decode_type="tiled_full",
93
+ )
94
+
95
+ video = pipeline(
96
+ height=480,
97
+ width=848,
98
+ num_frames=31,
99
+ num_inference_steps=64,
100
+ sigma_schedule=linear_quadratic_schedule(64, 0.025),
101
+ cfg_schedule=[4.5] * 64,
102
+ batch_cfg=False,
103
+ prompt="your favorite prompt here ...",
104
+ negative_prompt="",
105
+ seed=12345,
106
+ )
107
+ ```
108
 
 
 
 
 
 
109
 
110
  ## Running with Diffusers
111
 
112
+ You can also use diffusers.
113
+
114
  Install the latest version of Diffusers
115
 
116
  ```shell
 
161
 
162
  To learn more check out the [Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi) documentation
163
 
164
+
165
+ ## Model Architecture
166
+
167
+ Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture. Additionally, we are releasing an inference harness that includes an efficient context parallel implementation.
168
+
169
+ Alongside Mochi, we are open-sourcing our video AsymmVAE. We use an asymmetric encoder-decoder structure to build an efficient high quality compression model. Our AsymmVAE causally compresses videos to a 128x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
170
+
171
+ ### AsymmVAE Model Specs
172
+ |Params <br> Count | Enc Base <br> Channels | Dec Base <br> Channels |Latent <br> Dim | Spatial <br> Compression | Temporal <br> Compression |
173
+ |:--:|:--:|:--:|:--:|:--:|:--:|
174
+ |362M | 64 | 128 | 12 | 8x8 | 6x |
175
+
176
+ An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
177
+ Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
178
+
179
+ ### AsymmDiT Model Specs
180
+ |Params <br> Count | Num <br> Layers | Num <br> Heads | Visual <br> Dim | Text <br> Dim | Visual <br> Tokens | Text <br> Tokens |
181
+ |:--:|:--:|:--:|:--:|:--:|:--:|:--:|
182
+ |10B | 48 | 24 | 3072 | 1536 | 44520 | 256 |
183
+
184
+ ## Hardware Requirements
185
+ The repository supports both multi-GPU operation (splitting the model across multiple graphics cards) and single-GPU operation, though it requires approximately 60GB VRAM when running on a single GPU. While ComfyUI can optimize Mochi to run on less than 20GB VRAM, this implementation prioritizes flexibility over memory efficiency. When using this repository, we recommend using at least 1 H100 GPU.
186
+
187
+ ## Safety
188
+ Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
189
+
190
+ ## Limitations
191
+ Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
192
+
193
+ ## Related Work
194
+ - [ComfyUI-MochiWrapper](https://github.com/kijai/ComfyUI-MochiWrapper) adds ComfyUI support for Mochi. The integration of Pytorch's SDPA attention was taken from their repository.
195
+ - [mochi-xdit](https://github.com/xdit-project/mochi-xdit) is a fork of this repository and improve the parallel inference speed with [xDiT](https://github.com/xdit-project/xdit).
196
+
197
+
198
  ## BibTeX
199
  ```
200
  @misc{genmo2024mochi,