Question Tokenizer

by nebchi - opened May 30

May 30

Thank you for creating a great model. In the previous Llama2, you expanded the vocabulary. Is there a reason why you didn't do that separately in Llama3?

kaki-paper

Jun 3

I think because llama3 tokenizer already has some good amount of korean tokens. I don't know how much Korean tokens are in the tokenizer exactly, but if you write system prompt that reply to korean, you can see llama3 replying Korean.

nebchi

Jun 3

•

edited Jul 19

Thank you for the helpful answer. It seems that, similar to Gemma, the Korean tokenizer didn't require expansion since there are already many Korean tokens available.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment