NeMo
Haoxiang-Wang commited on
Commit
4773c4f
·
verified ·
1 Parent(s): a8dd797

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -63,7 +63,7 @@ Under the NVIDIA Open Model License, NVIDIA confirms:
63
 
64
  ## Model Architecture:
65
 
66
- We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://arxiv.org/abs/2309.15505) (FSQ) as the latent space quantizer.
67
 
68
  ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/638fb8cf2380ffd99caf8c2a/gQH5n9iCEtqZc7uutUwdL.jpeg)
69
 
@@ -321,4 +321,4 @@ We value you, the datasets, the diversity they represent, and what we have been
321
 
322
 
323
  # Core Contributors
324
- Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu
 
63
 
64
  ## Model Architecture:
65
 
66
+ We designed Cosmos Tokenizer using a lightweight and computationally efficient architecture, featuring a temporally causal design. Specifically, we employ causal temporal convolution and causal temporal attention layers to preserve the natural temporal order of video frames, ensuring seamless tokenization of images and videos using a single unified network architecture. The encoder and decoder form a symmetrical pair, which are mirrors of each other. The encoder starts with a 2-level [Haar wavelet](https://link.springer.com/book/10.1007/978-3-319-04295-4) transform layer, which down-samples inputs by a factor of 4 in both spatial and temporal dimensions. Likewise, the decoder ends with an inverse wavelet transform. We employ the vanilla autoencoder (AE) formulation to model the latent space for continuous tokenizers. For discrete tokenizers, we adopt the [Finite-Scalar-Quantization](https://openreview.net/forum?id=8ishA3LxN8) (FSQ) as the latent space quantizer.
67
 
68
  ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/638fb8cf2380ffd99caf8c2a/gQH5n9iCEtqZc7uutUwdL.jpeg)
69
 
 
321
 
322
 
323
  # Core Contributors
324
+ Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu