Tokenizer doesn't load with transformers 4.34.4
As mentioned in the model card, (edit:transformers==4.34.4
transformers==4.43.4
) doesn't seem to work while loading tokenizer. It seems to work fine on transformers==4.45.0
. The underlying tokenizers versions are 0.19.1 and 0.20.3 respectively. The tokenizer throws the following error:
{
"name": "Exception",
"message": "data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3",
"stack": "---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[1], line 2
1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained(\"/mnt/model_pvc/models/Llama-3.3-70B-Instruct\")
File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:896, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
892 if tokenizer_class is None:
893 raise ValueError(
894 f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
895 )
--> 896 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
898 # Otherwise we have to be creative.
899 # if model is an encoder decoder, the encoder tokenizer class is used by default
900 if isinstance(config, EncoderDecoderConfig):
File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2291, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
2288 else:
2289 logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2291 return cls._from_pretrained(
2292 resolved_vocab_files,
2293 pretrained_model_name_or_path,
2294 init_configuration,
2295 *init_inputs,
2296 token=token,
2297 cache_dir=cache_dir,
2298 local_files_only=local_files_only,
2299 _commit_hash=commit_hash,
2300 _is_local=is_local,
2301 trust_remote_code=trust_remote_code,
2302 **kwargs,
2303 )
File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2525, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
2523 # Instantiate the tokenizer.
2524 try:
-> 2525 tokenizer = cls(*init_inputs, **init_kwargs)
2526 except OSError:
2527 raise OSError(
2528 \"Unable to load vocabulary from file. \"
2529 \"Please check that the provided vocabulary is accessible and not corrupted.\"
2530 )
File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py:115, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
112 fast_tokenizer = copy.deepcopy(tokenizer_object)
113 elif fast_tokenizer_file is not None and not from_slow:
114 # We have a serialization from tokenizers which let us directly build the backend
--> 115 fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
116 elif slow_tokenizer is not None:
117 # We need to convert a slow tokenizer to build the backend
118 fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3"
}
There seem to be similar issues on other models across repos like 1, 2 etc.
Is it because some config is missing or is it a version mismatch ? If that is the case, can we please mention that 4.45.0 is necessary in the model card?
Hi @imdatta0 - Thanks for opening the issue. This is expected on older versions of transformers due to an update to transformers. That's why we mention using 4.43 and above: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/README.md?code=true#L85
Where did you come across 4.34.4
- I'd be happy to fix that?
Documentation has been updated. Supports 4.45.0
and later only.