Fix chat template
When using apply_chat_template function, an extra space is added after "[INST]", which original Mistral code does not. After removing extra space from Jinja template, the issue seems to have been solved.
Whats important is the tokenizer from the HF transformers implementation to match one on one mistral-common
as its the ground truth, does the current one not match from your experiments? 🤔
To keep it simple, both tokenizers encoded tokens match given both handle the same text, eg: <s>[INST]Hello[/INST]
, so this isn't a tokenizer issue. However, you do get an extra space when the HF apply_chat_template
function is used, so instead of getting the same text as the previous example, the HF apply_chat_template
function returns <s>[INST] Hello[/INST]
, adding space after the [INST]
token. This is fixed by editing the Ninja template used.
Mistral common provides both a Debug string and the Encoded tokens, so Im curious to know if they both match or if this new implementation will make the debug string not match. Usually we run a test script where we compare the tokenizer with both the encoded tokens and debug strings to match one-on-one with mistral_common
.
If you look at the HF apply_chat_template
function, it first applies the Ninja template so text is converted from Hello
to <s>[INST] Hello[/INST]
, and then sent to the tokenizer, so I think it should pass the test, given no change is done to the actual HF tokenizer.