[CLS] Token

#1
by insaf-im - opened

model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
list(last_hidden_states.shape)
[1, 1568, 768]

The output of the VideoMAE encoder is 768 features of length 1568. (1 is the batch size)
Can I please know which is the [CLS] token?

Hi,

VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.

Hence, in this case: (16//2) * (224 // 16)**2 = 1568.

To get a representation of an entire video, you can simply average pool the last hidden states along the sequence dimension:

import torch

video_features = torch.mean(last_hidden_state, dim=1)

Sign up or log in to comment