Problems with the OpenGPT-X tokenizer
Hello, and congratulations on the release! 🤗
I have tried to run some evaluations on the Teuken-7B-instruct-research-v0.4 but I keep encountering errors, which I believe come from the tokenizer.
Code to replicate (Transformers version: 4.42.3):
from transformers import GenerationConfig, TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True)
generation_config = GenerationConfig(max_new_tokens=100, do_sample=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generator = TextGenerationPipeline(model=model, task="text-generation",
tokenizer=tokenizer, device=device)
completion = generator("My name is", generation_config=generation_config)
Trying to use the model via the TextGenerationPipeline has given me (weirdly...) two distinct errors:
RuntimeError: Boolean value of Tensor with more than one value is ambiguous
Which is an error coming from Sentencepiece (Sentencepiece version: 0.2.0). And:
TypeError: HFGPTXTokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'
This error comes from the custom tokenizer. Could you please provide workarounds or push some patches? The skip_special_tokens
error occurs when I try running this model on the Language Model Evaluation Harness
, which works for most of the main models found in the Hub.
PS: Are there prospects for releasing a fast tokenizer?
Cheers!
forgot the code block specifier for Python ._.
@danielsteinigen can you help?
Hi
@nicholasKluge
, thanks for pointing this out. It should be fixed now.
We will also look into fast tokenizers soon.