HiroseKoichi
/

Llama-Salad-8x8B

Text Generation

nsfw

Not-For-All-Audiences

text-generation-inference

Mixture of Experts

Inference Endpoints

Model card Files Files and versions Community

HiroseKoichi commited on Jun 7

Commit

3485890

•

1 Parent(s): 9a7e041

Update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -12,6 +12,15 @@ tags:
 ---
 # Llama-Salad-8x8B
 # Details
 - **License**: [llama3](https://llama.meta.com/llama3/license/)

 ---
 # Llama-Salad-8x8B
+This MoE merge is meant to compete with Mixtral fine-tunes, more specifically [Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO), which I think is the best of them. I've done a bunch of side-by-side comparisons, and while I can't say it wins in every aspect, it's very close. Some of its shortcomings are multilingualism, storytelling, and roleplay, despite using models that are very good at those tasks.
+It won't respond in the language you prompt it with unless the model has already spoken that language, despite Suzume being designed to do just that. The model writes really well because of Soliloquy and Opus, but it doesn't quite understand the difference between roleplay and storytelling; it treats just about everything like a story and will over-respond to everything you do. If you want a good experience, you will either have to explain what roleplay is or show it by example, but it is very good if you do.
+I have narrowed down the reason behind these shortcomings to one thing: self-attention. The base model is actually the most important part of a MoE merge; you can think of it as taking that base model and improving it rather than merging all of the models' capabilities. If that base model has a specific writing style, behavior, or lack of knowledge for a specific task, then it will carry over into the MoE merge, regardless of the quality of the weights used.
+Likewise, I have found that censorship does not come from the model's weights but rather the self-attention; if you take the self-attention from an uncensored model and combine it with the weights from a censored model, then the resulting model will be uncensored. The self-attention decides what the model should be doing and how to do it, and the weights predict tokens according to its specifications.
+I have tried using over a dozen different models as the base, and Synthia is by far the best. Aside from swapping in better models, the only way that I can see to improve from here is to merge Synthia with other models in order to reduce these shortcomings, which I will definitely be doing in the future.
 # Details
 - **License**: [llama3](https://llama.meta.com/llama3/license/)