Missing Merge Pair (based on the original BPE from whisper)
When you run the original tokenizer from whisper package
whisper_tokenizer = tokenizer.get_tokenizer(multilingual=True, language="en", task="transcribe")
whisper_tokenizer.encoding.encode(" the")
You get
[264]
But the same code gives different result using the tokenizer.json included here
hf_tok = tokenizers.Tokenizer.from_pretrained("openai/whisper-large")
hf_tok.encode(" the", add_special_tokens=False).ids
You get
[220, 3322]
I think this is because of missing merge (but that is present in the original BPE)
Using the function here,
I detected only one missing merge
Correct tokenizer.json should be as follows
"merges": [
"Ġ t",
"Ġ a",
"Ġt h",
whereas current one
"merges": [
"Ġ a",
"Ġt h",
Accordingly, other repos and merges.txt should be modified
cc @ArthurZ when you get the chance 🙌
I just tripped over this too. It would be good to see it fixed.
Oh wow sorry. Just saw that as well will have look
I'll check the conversion script, there might be a bug in there.
#28 adresses this
What's weird is that installing pip install openai-whisper==20230117
and doing:
>>> from whisper.tokenizer import get_tokenizer
>>> tokenizer = get_tokenizer(True)
>>> tokenizer.encode(" the")
[220, 3322]
also gives the wrong result.
However the merge was there since day 1: https://github.com/openai/whisper/blame/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/assets/gpt2/merges.txt.
The conversion from slow to fast is probably removing it .
See:
tokenizer.tokenizer.save_pretrained("local_path/whisper-old")
opening the merges.txt
, it no longer has the first merge.
If we remove the tokenizer.json
and add the correct merge to merges.txt
the conversions works well with tokenizer.from_pretrained("local_path/whisper-old")
, meaning it's probably the layer on top of transformers that was causing this.