metadata

library_name: tf-keras
license: mit
metrics:
  - accuracy
pipeline_tag: video-classification
tags:
  - pretraining
  - finetuning
  - vision
  - videomae

VideoMAE

Paper	Colab	HF Space	HF Hub

Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

Masked Video Modeling for Video Pre-Training
A Simple, Efficient and Strong Baseline in SSVP
High performance, but NO extra data required

This is a unofficial Keras reimplementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch implementation can be found here.

Model Zoo

The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.

Kinetics-400

For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel and h5 format.

Backbone	#Frame	Top-1	Top-5	Params [FT] MB	Params [PT] MB)	FLOPs
ViT-S	16x5x3	79.0	93.8	22	24	57G
ViT-B	16x5x3	81.5	95.1	87	94	181G
ViT-L	16x5x3	85.2	96.8	304	343	-
ViT-H	16x5x3	86.6	97.1	632	?	-

^{?* Official ViT-H backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.} ^{The FLOPs of encoder models (FT) are reported only.}

Something-Something V2

For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

Backbone	#Frame	Top-1	Top-5	Params [FT] MB	Params [PT] MB	FLOPs
ViT-S	16x2x3	66.8	90.3	22	24	57G
ViT-B	16x2x3	70.8	92.4	86	94	181G

UCF101

For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

Backbone	#Frame	Top-1	Top-5	Params [FT] MB	Params [PT] MB	FLOPS
ViT-B	16x5x3	91.3	98.5	86	94	181G