Vocab extension - inquiry
Hello there I am interested to know how you managed to extend the tokenizer's vocabulary of the original model to include Korean tokens as well !?
This would be really helpful for me 🤗
Tnx in advance
Hi, First of all this model's vocabulary is not extended for Korean tokens. It uses original Mixtral tokenizer.
But there's somethings i can explain about extending vocabulary.
Since, Most of LLMs in huggingface are not optimized for Korean, many models are like use ~10 tokens per 1 word. So, Korean LLM Community had dug into extending vocabulary.
First, We started by just add vocabulary(tokens) that represent Korean word well. (E.G. https://huggingface.co/beomi/llama-2-ko-7b). As you know after adding vocabs to model the model's text-generation is broken. So, In upper example he trained the model over 40B tokens to fit Korean vocabulary well. But 40B token training is not an easy job. It takes a long time and really expensive.
So, Korean LLM Community started to find a solution. While the methodology for efficient vocabulary expansion is not yet well established, it roughly follows below.
(Source - https://huggingface.co/yanolja/KoSOLAR-10.7B-v0.2)
Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
Freezing the embed_tokens layer for existing tokens is crucial to maintain overall performance.
Unfreezing the lm_head layer for existing tokens actually boosts performance.
As a result, we froze the internal layers and the first 32,000 embed_tokens, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
We are keep searching for solutions. I hope this has been helpful to you.
https://huggingface.co/maywell/Mistral-ko-7B-v0.1
This is my attempt to extend vocabulary, due to budget issue it is undertrained
Tnx
@maywell
I will definitely get back to this discussion later with a much rested head
Anyway, what we are trying to do is, extending vocab and then training the model on a 30B tokens of Arabic text (Not full training but using LoRA) so i believe extending vocab would definitely be suitable for my case !?
Tnx @maywell
I will definitely get back to this discussion later with a much rested headAnyway, what we are trying to do is, extending vocab and then training the model on a 30B tokens of Arabic text (Not full training but using LoRA) so i believe extending vocab would definitely be suitable for my case !?
Of course it is. But, I recommend to do full training with the method above