[CLS] Token

by insaf-im - opened Oct 10, 2022

Oct 10, 2022

model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
list(last_hidden_states.shape)
[1, 1568, 768]

The output of the VideoMAE encoder is 768 features of length 1568. (1 is the batch size)
Can I please know which is the [CLS] token?

nielsr

Dec 21, 2023

•

edited Dec 21, 2023

Hi,

VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.

Hence, in this case: (16//2) * (224 // 16)**2 = 1568.

To get a representation of an entire video, you can simply average pool the last hidden states along the sequence dimension:

import torch

video_features = torch.mean(last_hidden_state, dim=1)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment