library_name: tf-keras
license: mit
metrics:
- accuracy
pipeline_tag: video-classification
tags:
- pretraining
- finetuning
- vision
- videomae
VideoMAE
Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:
- Masked Video Modeling for Video Pre-Training
- A Simple, Efficient and Strong Baseline in SSVP
- High performance, but NO extra data required
This is a unofficial Keras
reimplementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch
implementation can be found here.
Model Zoo
The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.
Kinetics-400
For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel
and h5
format.
Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB) | FLOPs |
---|---|---|---|---|---|---|
ViT-S | 16x5x3 | 79.0 | 93.8 | 22 | 24 | 57G |
ViT-B | 16x5x3 | 81.5 | 95.1 | 87 | 94 | 181G |
ViT-L | 16x5x3 | 85.2 | 96.8 | 304 | 343 | - |
ViT-H | 16x5x3 | 86.6 | 97.1 | 632 | ? | - |
?* Official ViT-H
backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.
The FLOPs of encoder models (FT) are reported only.
Something-Something V2
For SSv2, VideoMAE is trained around 2400 epoch without any extra data.
Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPs |
---|---|---|---|---|---|---|
ViT-S | 16x2x3 | 66.8 | 90.3 | 22 | 24 | 57G |
ViT-B | 16x2x3 | 70.8 | 92.4 | 86 | 94 | 181G |
UCF101
For UCF101, VideoMAE is trained around 3200 epoch without any extra data.
Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPS |
---|---|---|---|---|---|---|
ViT-B | 16x5x3 | 91.3 | 98.5 | 86 | 94 | 181G |