Correrct Transformers Pad Token
#7
by
patrickvonplaten
- opened
No description provided.
import open_clip
tokenizer = open_clip.get_tokenizer('ViT-bigG-14')
print(tokenizer("hello"))
gives:
tensor([[49406, 3306, 49407, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]])
which means the padding token should be 0
not 49407
.
This PR corrects the Hugging Face Transformers version so that it matches the open_clip tokenizer:
from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
print(tokenizer("hello", max_length=77, padding="max_length", truncation=True))
patrickvonplaten
changed pull request status to
open
patrickvonplaten
changed pull request title from
Correct pad token tokenizer
to Correrct Transformers Pad Token
@patrickvonplaten
@julien-c
it is indeed wrong, but as mentioned in slack, this probably means all HF Transformers based tokenizers for OpenCLIP AND probably the OpenAI originals are wrong as OpenCLIP Transformers tokenizer config was just copied from the openai/
ones on the hub. I can't merge as I'm not the owner, that's
@mitchellw
@patrickvonplaten so I have write access and can merge this now, is this still a desired change making it match original tokenizer or think people are relying on this behaviour?