VAE
Paper • 2411.17459 • Published • 10Note 1. video energy is mainly concentrated in the low-frequency subband 2. we establish an energy flow pathway outside the backbone so that low-frequency information can smoothly flow from video to latent representation during encoding process. 3. allow the model to attention more on low-frequency information, and apply higher compression rates to high-frequency information.
MAGVIT: Masked Generative Video Transformer
Paper • 2212.05199 • PublishedNote 1. Inflation 1.1 Use a central inflation method for the convolution layers, where the corresponding 2D kernel fills in the temporally central slice of a zero-filled 3D kernel. 1.2 Replace the same (zero) padding in the convolution layers with reflect padding,
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper • 2310.05737 • Published • 4Note 1. Known as MAGVIT-2. Growing the vocabulary size can benefit the generation quality. 2. Both reconstruction and generation consistently improve as the vocabulary size increases. Vocab is single-dimensional variables For example, latent feat z \in R^{4} [-1, 1, -2, 3] --> [0, 1, 0, 1] --> sum([0, 2^1, 0, 2^3]) --> 10 [ 1, 1, 1, 3] --> [1, 1, 1, 1] --> sum([2^0, 2^2, 2^2, 2^3]) --> 15
Finite Scalar Quantization: VQ-VAE Made Simple
Paper • 2309.15505 • Published • 21Note 1. Known as FSQ. 2.1 achieve high codebook utilization by design (almost 100%). 2.2 Before FSQ, most of the literature used unbounded scalar quantization, in which the range of integers is not limited by the encoder but only by constraining the representation's entropy. 2.3 vocab size: |C| = L^d 2.4 a simple heuristic that performs well in all considered tasks: Use Li ≥ 5 ∀i.
Large Motion Video Autoencoding with Cross-modal Video VAE
Paper • 2412.17805 • Published • 21