Difference Behavior of Mistral Tokenizer and Huggingface Tokenizer
#58
by
magic282
- opened
test case:
messages = [
{"role": "system", "content": "You are helpful assistant."},
{"role": "user", "content": "Hello."},
{"role": "assistant", "content": "Hello there!"},
{"role": "user", "content": "Who is Trump?"},
]
Mistral Tokenzier:
m_messages = []
for m in messages:
if m['role'] == 'user':
m_messages.append(UserMessage(content=m['content']))
elif m['role'] == 'assistant':
m_messages.append(AssistantMessage(content=m['content']))
elif m['role'] == 'system':
m_messages.append(SystemMessage(content=m['content']))
completion_request = ChatCompletionRequest(messages=m_messages)
tokens = tokenizer.encode_chat_completion(completion_request).tokens
output:
[1, 3, 22177, 1046, 4, 22177, 2156, 1033, 2, 3, 4568, 1584, 20351, 27089, 1338, 31500, 1395, 22279, 1063, 4]
Huggingface Tokenizer:
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
output:
tensor([[ 1, 3, 45383, 1046, 4, 45383, 2156, 1033, 2, 3,
3213, 1584, 20351, 27089, 1338, 31500, 1395, 22279, 1063, 4]],
device='cuda:0')
The difference is actually the output of HF tokenizer starts with space after token id 3.
Is this expected?
@pandora-s
I used an older version. Just pulled the latest and they are the same (as Mistral Tokenizer) now:
tensor([[ 1, 3, 22177, 1046, 4, 22177, 2156, 1033, 2, 3,
4568, 1584, 20351, 27089, 1338, 31500, 1395, 22279, 1063, 4]],
device='cuda:0')
magic282
changed discussion status to
closed