VAE

Norm 's Collections

VAE

Video2Video

TI2V Research

Image / Video Gen

Multimodal Language Model

Fundamental Research

Language Model

Computer Vision

Open Datasets

updated Dec 25, 2024

Upvote

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Paper • 2411.17459 • Published Nov 26, 2024 • 11

Note 1. video energy is mainly concentrated in the low-frequency subband 2. we establish an energy flow pathway outside the backbone so that low-frequency information can smoothly flow from video to latent representation during encoding process. 3. allow the model to attention more on low-frequency information, and apply higher compression rates to high-frequency information.
MAGVIT: Masked Generative Video Transformer

Paper • 2212.05199 • Published Dec 10, 2022

Note 1. Inflation 1.1 Use a central inflation method for the convolution layers, where the corresponding 2D kernel fills in the temporally central slice of a zero-filled 3D kernel. 1.2 Replace the same (zero) padding in the convolution layers with reflect padding,
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Paper • 2310.05737 • Published Oct 9, 2023 • 4

Note 1. Known as MAGVIT-2. Growing the vocabulary size can benefit the generation quality. 2. Both reconstruction and generation consistently improve as the vocabulary size increases. Vocab is single-dimensional variables For example, latent feat z \in R^{4} [-1, 1, -2, 3] --> [0, 1, 0, 1] --> sum([0, 2^1, 0, 2^3]) --> 10 [ 1, 1, 1, 3] --> [1, 1, 1, 1] --> sum([2^0, 2^2, 2^2, 2^3]) --> 15
Finite Scalar Quantization: VQ-VAE Made Simple

Paper • 2309.15505 • Published Sep 27, 2023 • 22

Note 1. Known as FSQ. 2.1 achieve high codebook utilization by design (almost 100%). 2.2 Before FSQ, most of the literature used unbounded scalar quantization, in which the range of integers is not limited by the encoder but only by constraining the representation's entropy. 2.3 vocab size: |C| = L^d 2.4 a simple heuristic that performs well in all considered tasks: Use Li ≥ 5 ∀i.
Large Motion Video Autoencoding with Cross-modal Video VAE

Paper • 2412.17805 • Published Dec 23, 2024 • 24

Upvote