Tokenizer issues with "Ő" and "Ű" characters
We're having a problem with Hungarian texts that include "Ő" and "Ű" characters. This issue is occurring due to the tokenizer, as demonstrated in the following test. The lowercase "ő" and "ű" characters are not affected.
Test:
./llama-tokenize -m ../models/mistral-nemo-instruct-2407-q8_0.gguf -p "VEVŐ"
Results:
1 -> '<s>'
16578 -> 'VE'
1086 -> 'V'
1197 -> '?'
1176 -> '?'
Test:
./llama-tokenize -m ../models/mistral-nemo-instruct-2407-q8_0.gguf -p "HŰSÉG"
Results:
1 -> '<s>'
29537 -> 'H'
968 -> '?'
947 -> '?'
29503 -> 'S'
29669 -> 'É'
29545 -> 'G'
Hi, this looks like an issue with the quantized models or the script you are using. After testing with prompts such as "repeat this: HŰSÉG" the model (and of course the tokenizer) is capable of both understanding and outputting it, did you try with the hf tokenizer and our mistral-common tokenizer?
Thanks for the quick answer!
The prompt "repeat this: HŰSÉG" responds "HŰSÉG", it's ok.
Here is a more detailed test for the difference, see the word "VEVŐ" (buyer) in the prompts:
Prompt with "Ő" letter:
[INST] Here is the text:
\`\`\`
ELADÓ VEVŐ
John Doe Sample Company
Budapest London
Any street 112 Sesame street 8
1185 123444
Magyarország England
\`\`\`
Who is the sender and the recipient?
[/INST]
Wrong result (same results with Le Chat):
Based on the provided text, which appears to be a header of an email or letter, here are the sender and recipient details:
Sender (From):
- Name: John Doe
- Company: Sample Company
- Address: Any street 112, 1185 Budapest, Magyarország
Recipient (To):
- Name: ELADÓ VEVŐ
- Company: Sample Company
- Address: Sesame street 8, 123444 London, England
Le Chat response:
Based on the provided text, which appears to be a mock invoice or letterhead, here are the sender and recipient details:
Sender (From):
Name: John Doe
Company: Sample Company
Address: Sesame street 8, 123444, London, England
Recipient (To):
Name: ELADÓ
Company: VEVŐ
Address: Any street 112, 1185, Budapest, Magyarország (Hungary)
Same prompt with lowercase "ő" letter:
[INST] Here is the text:
\`\`\`
eladó vevő
John Doe Sample Company
Budapest London
Any street 112 Sesame street 8
1185 123444
Magyarország England
\`\`\`
Who is the sender and the recipient?
[/INST]
Results:
Based on the provided text, which appears to be a header for a business letter or invoice, here are the sender and recipient details:
Sender (Eladó):
- Name: John Doe
- Address: Any street 112, 1185 Budapest, Magyarország
Recipient (Vevő):
- Name: Sample Company
- Address: Sesame street 8, 123444 London, England
Le Chat response:
Based on the text provided, here's who the sender and recipient are:
Sender (eladó): John Doe
Address: Any street 112, 1185 Budapest, Magyarország
Recipient (vevő): Sample Company
Address: Sesame street 8, 123444 London, England
I've compared the tokenizer for text "HŰSÉG VEVŐ" in Mistral Nemo and Mistral Small, see the results:
- Mistral Nemo Instruct 2407
1 -> '<s>'
1072 -> 'H'
1197 -> '?'
1176 -> '?'
1083 -> 'S'
7904 -> 'É'
1071 -> 'G'
42981 -> ' VE'
1086 -> 'V'
1197 -> '?'
1144 -> '?'
- Mistral Small Instruct 2409
1 -> '<s>'
29537 -> 'H'
968 -> '?'
947 -> '?'
29503 -> 'S'
29669 -> 'É'
29545 -> 'G'
1318 -> ' V'
8089 -> 'EV'
31033 -> 'Ő'
The HF tokenizer playground results: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
- Mistral v3
repeat this: "HŰSÉG and VEVŐ"
<s> repeat this: "H��SÉG and VEVŐ"
@LouiSeHU I dont understand, how is it wrong?
Based on the provided text, which appears to be a mock invoice or letterhead, here are the sender and recipient details:
Sender (From):
Name: John Doe
Company: Sample Company
Address: Sesame street 8, 123444, London, England
Recipient (To):
Name: ELADÓ
Company: VEVŐ
Address: Any street 112, 1185, Budapest, Magyarország (Hungary)
Doesnt the model properly tokenize the special characters you mentionned?
For the playground you tagged, I cannot see Mistral v3 Tekken in the options, but I will take a look and reproduce via transformers and mistral-common to try to understand 👍
@pandora-s Please take a look at the difference in the above example. If "VEVŐ" (means buyer in Hungarian) is in uppercase, then there is an error. I don't know if this is caused by the tokenizer, but I also checked it with a smaller Mistral model (7B v0.3), and in that one, the tokenizer recognizes the "Ő" character, and the response is also correct.
1318 -> ' V'
8089 -> 'EV'
31033 -> 'Ő'
Based on the provided text, it appears that the sender is "John Doe" from Budapest, Hungary (Magyarország), at Any street 112, postal code 1185. The recipient is "Sample Company" in London, England, at Sesame street 8, postal code 123444.
@pandora-s Was it possible to verify the same result at your environment?
Can we expect a new version where these tokens also work properly like in the 7B and Small models?