Tokenized Chat template seems weird

#12
by PierreLepagnol - opened

I find it peculiar that elements from the chat template are segmented into several parts when tokenized.

Consider the primary illustration:

<|user|>
Which famous math number begins with 1.6 ...?<|endoftext|>
<|assistant|>
The number you are referring to is 1.618033988749895. This is the famous value known as the golden ratio<|endoftext|>

The tokens <|user|> and <|assistant|> are split into ['<', '|', 'user', '|', '>\n'] and ['<', '|', 'assistant', '|', '>\n'] respectively.
Does this seem standard?

Sign up or log in to comment