Problem with the `max_model_length` attribute
#3
by
h4c5
- opened
The max_model_length
attribute of the camembert/camembert-base
Tokenizer is set to VERY_LARGE_INTEGER
:
import torch
from transformers import CamembertModel, CamembertTokenizer
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656
This is probably because the model name in max_model_input_sizes
is camembert-base
instead of camembert/camembert-base
(see pretrained tokenizer initialization) :
print(tokenizer.max_model_input_sizes)
# {'camembert-base': 512}
As a result, the example given in the model card do not work with large sequences :
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
camembert = CamembertModel.from_pretrained("camembert/camembert-base")
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !"*100)
encoded_sentence = tokenizer.encode(tokenized_sentence)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# RuntimeError: The expanded size of the tensor (802) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, 802]. Tensor sizes: [1, 514]