Correrct Transformers Pad Token

#7
No description provided.
import open_clip

tokenizer = open_clip.get_tokenizer('ViT-bigG-14')

print(tokenizer("hello"))

gives:

tensor([[49406,  3306, 49407,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])

which means the padding token should be 0 not 49407.

This PR corrects the Hugging Face Transformers version so that it matches the open_clip tokenizer:

from transformers import CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")

print(tokenizer("hello", max_length=77, padding="max_length", truncation=True))
patrickvonplaten changed pull request status to open
patrickvonplaten changed pull request title from Correct pad token tokenizer to Correrct Transformers Pad Token
LAION eV org

@patrickvonplaten @julien-c it is indeed wrong, but as mentioned in slack, this probably means all HF Transformers based tokenizers for OpenCLIP AND probably the OpenAI originals are wrong as OpenCLIP Transformers tokenizer config was just copied from the openai/ ones on the hub. I can't merge as I'm not the owner, that's @mitchellw

LAION eV org

@patrickvonplaten so I have write access and can merge this now, is this still a desired change making it match original tokenizer or think people are relying on this behaviour?

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment