Missing Merge Pair (based on the original BPE from whisper)

#16

by mert-kurttutan - opened Apr 13, 2023

Apr 13, 2023

When you run the original tokenizer from whisper package

whisper_tokenizer = tokenizer.get_tokenizer(multilingual=True, language="en", task="transcribe")
whisper_tokenizer.encoding.encode(" the")

You get

[264]

But the same code gives different result using the tokenizer.json included here

hf_tok = tokenizers.Tokenizer.from_pretrained("openai/whisper-large")
hf_tok.encode(" the", add_special_tokens=False).ids

You get

[220, 3322]

I think this is because of missing merge (but that is present in the original BPE)

Using the function here,
I detected only one missing merge
Correct tokenizer.json should be as follows

    "merges": [
      "Ġ t",
      "Ġ a",
      "Ġt h",

whereas current one

    "merges": [
      "Ġ a",
      "Ġt h",

Accordingly, other repos and merges.txt should be modified

sanchit-gandhi

Apr 18, 2023

cc @ArthurZ when you get the chance 🙌

nlewins

Dec 12, 2023

I just tripped over this too. It would be good to see it fixed.

ArthurZ

Dec 12, 2023

Oh wow sorry. Just saw that as well will have look

ArthurZ

Dec 15, 2023

I'll check the conversion script, there might be a bug in there.

ArthurZ

Dec 18, 2023

•

edited Dec 18, 2023

#28 adresses this

ArthurZ

Dec 18, 2023

What's weird is that installing pip install openai-whisper==20230117 and doing:

>>> from whisper.tokenizer import get_tokenizer
>>> tokenizer = get_tokenizer(True)
>>> tokenizer.encode(" the")
[220, 3322]

also gives the wrong result.
However the merge was there since day 1: https://github.com/openai/whisper/blame/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/assets/gpt2/merges.txt.
The conversion from slow to fast is ~~probably~~ removing it .

See:

tokenizer.tokenizer.save_pretrained("local_path/whisper-old")

opening the merges.txt, it no longer has the first merge.
If we remove the tokenizer.json and add the correct merge to merges.txt the conversions works well with tokenizer.from_pretrained("local_path/whisper-old"), meaning it's probably the layer on top of transformers that was causing this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment