TencentARC
/

Divot

tttoaster commited on Dec 6, 2024

Commit

eec132e

verified ·

1 Parent(s): 4894f0e

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -3,11 +3,11 @@ license: apache-2.0
 ---
 # Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
 [![Static Badge](https://img.shields.io/badge/Github-black)](https://github.com/TencentARC/Divot)
->We introduce Divot, a **Di**ffusion-Powered **V**ide**o** **T**okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations.
 Building upon the Divot tokenizer, we present **Divot-LLM** through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model.
 All models, training code and inference code are released!

 ---
 # Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
+[![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2412.04432)
 [![Static Badge](https://img.shields.io/badge/Github-black)](https://github.com/TencentARC/Divot)
+>We introduce [Divot](https://arxiv.org/abs/2412.04432), a **Di**ffusion-Powered **V**ide**o** **T**okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations.
 Building upon the Divot tokenizer, we present **Divot-LLM** through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model.
 All models, training code and inference code are released!