OrcaMaidXL-21.7.B (Depth Up-Scaled) Orca-2-21.7B SLERPED with NeverSleep/Noromaid-21.7B-0.4-DPO
Madison, I had this idea. And, I thought you would like to do the honors - since OrcaMaidXL-17B is your baby (beloved creation). As the title poignantly states: OrcaMaidXL-21.7B (Depth Up-Scaled). I want to know how OrcaMaid benefits from Depth Up-Scaling (DUS). π€
Merge Recipe ("the sauce"):
Merge 1:
Orca-2-21.7B (Microsoft/Orca-2-13b Depth Upscaled, m = 8)
Merge 2:
Noromaid-21.7B-0.4-DPO (NeverSleep/Noromaid-13B-0.4-DPO Depth Up-Scaled, m = 8)
Merge 3:
SLERP Orca-2-21.7B and Noromaid-21.7B-0.4-DPO together with your specific merge magic
And, Serve... Bon apet-llm. π©βπ³π
Cool idea, but I'm afraid it might end up being sub-par, being a merge of two self-merges, seeing as self-merges are already somewhat unstable. But I'd be willing to give it a shot maybe, sometime this week
Have you tried Noromaid-10.7B-0.4-DPO? its not unstable - if anything Depth Up-Scaling has improved its capabilities and enhanced its personality (via its increase in parameter size). Also: other large language models, which I Depth Up-Scaled, including: Cerebrum-1.0-10.7B, Hermes-2-Pro-Mistral-10.7B, Tess-10.7B-v2.0 also don't exhibit the unstableness you mentioned (at least from my use of them). I hypothesize that this is due to the fact that, for the most part, Depth Up-Scaling, as layed out in SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling, keeps more of the layers in their originally trained sequences, and thus keeps the layers in better congruence. Whereas, other merge layering techniques, which seek to increase the increase the size of the llm, tend to interleave/interlace more layers from other areas of the llm - weakening its layer congruency. This is shown in the noticeable improvement on OpenLLM Leaderboard's Benchmarks for Mistral-12.25B-v0.2 vs Mistral-10.7B-v0.2, where Mistral-12.25B-v0.2, retains slightly more layers in congruence, than Mistral-10.7B-v0.2. This at least has been my observation so far. Tying subjectively into this: I also found something else interesting: BigOrca-2-XB, which uses the same layer interleaving merge techniques as AbacusAI/Bigstral-12b-32k, is unable to be quantized down below Q2_K_M (any more aggressive quantization, and the quantize command errors out...). However, these Depth Up-Scaled llms don't have this same issue. For example, bartowski quantized Cerebrum-1.0-10.7B to every GGUF quant size possible, with seemingly no issues. Which, further confirms and leads me on to believe that layer congruence is of particular importance to ensure the stability of a large language model (LLM). π€
However, having said all that, I do share your concerns about the creation of unstable, un-usable large language models from model merges. So, in order to learn and understand how Depth Up-Scaled large language models behave when SLERP'ed together. I thought of two merge projects: the first: a SLERP of the two llms, proceeded by a subsequent Depth-Upscaling. And, the second: a Depth Up-Scaling of each model, followed by the aforementioned SLERP. In my honest opinion, I think this will give us the best possible way to objectively judge and measure which model merging recipe and merge technique(s) give us the best large language model, which is stable, useable, and, performant. Right now, I'm hoping that more layers SLERP'ed together yields us the best large language model. But, that might change, we'll just have to see. Thanks for your time and consideration. As always: it's a pleasure to discuss and speculate ideas with you. π
Hmm, I guess I'll need to try your upscaled models then. All of the models I've tried that use depth up-scaling without further fine-tuning have been prone to weird mistakes, like the repetition issue I mentioned, or adding spaces where there shouldn't be. Some more than others. I dunno. I will look more into this in the next few days :)
I test the models using LLM Studio and Faraday.dev. LLM Studio because it offers the most diverse set of supported prompt templates, and, Faraday.dev, because it provides the best (E)RP environment (in my opinion)... Thanks, Madison. I can't wait to see how all this turns out! π€©
what if we depth up-scaled using the two different models directly? as in, using most of the layers from Orca and then adding most of the layers from Noromaid at the end. hmmmm
Hmmm... I didn't think of that! Now you have my curiosity. I need to know. ππ€―