Update README.md
Browse files
README.md
CHANGED
@@ -63,7 +63,7 @@ Under the NVIDIA Open Model License, NVIDIA confirms:
|
|
63 |
|
64 |
## Model Architecture:
|
65 |
|
66 |
-
We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://
|
67 |
|
68 |

|
69 |
|
@@ -321,4 +321,4 @@ We value you, the datasets, the diversity they represent, and what we have been
|
|
321 |
|
322 |
|
323 |
# Core Contributors
|
324 |
-
Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu
|
|
|
63 |
|
64 |
## Model Architecture:
|
65 |
|
66 |
+
We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://openreview.net/forum?id=8ishA3LxN8) (FSQ) as the latent space quantizer.
|
67 |
|
68 |

|
69 |
|
|
|
321 |
|
322 |
|
323 |
# Core Contributors
|
324 |
+
Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu
|