Tokenizer doesn't load with transformers 4.34.4

#21
by imdatta0 - opened

As mentioned in the model card, transformers==4.34.4 (edit:transformers==4.43.4) doesn't seem to work while loading tokenizer. It seems to work fine on transformers==4.45.0. The underlying tokenizers versions are 0.19.1 and 0.20.3 respectively. The tokenizer throws the following error:

{
    "name": "Exception",
    "message": "data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3",
    "stack": "---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[1], line 2
      1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained(\"/mnt/model_pvc/models/Llama-3.3-70B-Instruct\")

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:896, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    892     if tokenizer_class is None:
    893         raise ValueError(
    894             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    895         )
--> 896     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    898 # Otherwise we have to be creative.
    899 # if model is an encoder decoder, the encoder tokenizer class is used by default
    900 if isinstance(config, EncoderDecoderConfig):

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2291, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2288     else:
   2289         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2291 return cls._from_pretrained(
   2292     resolved_vocab_files,
   2293     pretrained_model_name_or_path,
   2294     init_configuration,
   2295     *init_inputs,
   2296     token=token,
   2297     cache_dir=cache_dir,
   2298     local_files_only=local_files_only,
   2299     _commit_hash=commit_hash,
   2300     _is_local=is_local,
   2301     trust_remote_code=trust_remote_code,
   2302     **kwargs,
   2303 )

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2525, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2523 # Instantiate the tokenizer.
   2524 try:
-> 2525     tokenizer = cls(*init_inputs, **init_kwargs)
   2526 except OSError:
   2527     raise OSError(
   2528         \"Unable to load vocabulary from file. \"
   2529         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2530     )

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py:115, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    112     fast_tokenizer = copy.deepcopy(tokenizer_object)
    113 elif fast_tokenizer_file is not None and not from_slow:
    114     # We have a serialization from tokenizers which let us directly build the backend
--> 115     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    116 elif slow_tokenizer is not None:
    117     # We need to convert a slow tokenizer to build the backend
    118     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3"
}

There seem to be similar issues on other models across repos like 1, 2 etc.
Is it because some config is missing or is it a version mismatch ? If that is the case, can we please mention that 4.45.0 is necessary in the model card?

Meta Llama org

Hi @imdatta0 - Thanks for opening the issue. This is expected on older versions of transformers due to an update to transformers. That's why we mention using 4.43 and above: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/README.md?code=true#L85

Where did you come across 4.34.4- I'd be happy to fix that?

Hey @reach-vb , sorry I made a couple of typos in the issue. The issue happens in 4.43.4 (instead of the mentioned 4.34.4)
The issue doesn't happen in 4.45.0. I have tried a few versions in between and doesn't work.

Meta Llama org

Thanks for the info @imdatta0 - let me patch that

Meta Llama org

Documentation has been updated. Supports 4.45.0 and later only.

vontimitta changed discussion status to closed

Sign up or log in to comment