Question Tokenizer
#7
by
nebchi
- opened
Thank you for creating a great model. In the previous Llama2, you expanded the vocabulary. Is there a reason why you didn't do that separately in Llama3?
I think because llama3 tokenizer already has some good amount of korean tokens. I don't know how much Korean tokens are in the tokenizer exactly, but if you write system prompt that reply to korean, you can see llama3 replying Korean.
Thank you for the helpful answer. It seems that, similar to Gemma, the Korean tokenizer didn't require expansion since there are already many Korean tokens available.