almanach/camembert-base-legacy · Problem with the `max_model

The max_model_length attribute of the camembert/camembert-base Tokenizer is set to VERY_LARGE_INTEGER :

import torch
from transformers import CamembertModel, CamembertTokenizer

tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656

This is probably because the model name in max_model_input_sizes is camembert-baseinstead of camembert/camembert-base (see pretrained tokenizer initialization) :

print(tokenizer.max_model_input_sizes)
# {'camembert-base': 512}

As a result, the example given in the model card do not work with large sequences :

tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
camembert = CamembertModel.from_pretrained("camembert/camembert-base")

tokenized_sentence = tokenizer.tokenize("J'aime le camembert !"*100)
encoded_sentence = tokenizer.encode(tokenized_sentence)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)

# RuntimeError: The expanded size of the tensor (802) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 802].  Tensor sizes: [1, 514]

almanach
/

camembert-base-legacy

Problem with the `max_model_length` attribute