Which one is better, finetuning with a sequence length of 1024 or 2048 in LLaMA1 and LLaMA2? And what are the reasons behind the choice?
Hello,
I have a question about the sequence lengths for the following two models:
- upstage/Llama-2-70b-instruct-1024
- upstage/llama-30b-instruct-2048
As far as I know, for LLaMA1, the default sequence length is 1024. However, to better capture context and improve training, you fine-tuned the model with a sequence length of 2048.
Now, for LLaMA2, despite having a default sequence length of 2048, I noticed that you fine-tuned it with a sequence length of 1024. I'm curious if there's a specific reason for this choice.
I'm interested in understanding the perspective from which you analyze the results of both models.
If there's anything I'm misunderstanding, I would appreciate your guidance.
Thank you.
Hello.
As you may have experienced, I believe Instruction Tuning of LLM is a field of empirical experimentation.
In the case of llama-30b, a higher score was achieved with more Orca style dataset and max_seq_len:2048.
However, for llama-2-70b, in our setting, a smaller size dataset and max_seq_len:1024 scored better.
In fact, it recorded the highest score on our internal leaderboard when only about 50k of a dataset other than Orca dataset was used.
Llama-2-70b tended to overfit faster at max_seq_len:2048, so it performed worse than llama-2-70b-hf. However, we do not plan to do additional experiments to solve this. (Because there is not much benefit in terms of cost)
In conclusion, in our setting, the performance of llama-2-70b was better at max_seq_len:1024, so we chose 1024.
I hope my answer was sufficient for you.
(For reference, according to each model's config.json, max_position_embeddings is 2048 for llma1 and 4096 for llama2.)
Thank you so much for your thoughtful response.