@grimjim on Hugging Face: "Below we experiment with negative merger weighting (-1.0!) using task…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

grimjim

posted an update Jul 4

Post

2250

Below we experiment with negative merger weighting (-1.0!) using task arithmetic. Merge formula on the model card and in the repo itself.

This model is steered to behave opposite to what MopeyMule demonstrated.

Based on the implications of the merge technique, we also propose Orthogonalized Vector Adaptation (OVA). We also extract a LoRA of the counter-refusal abliteration steering vector.

The resulting merger is not a perfect model, but it's a behaviorally interesting model. The model name was inspired by a Philip K. Dick story.
grimjim/Llama-3-Perky-Pat-Instruct-8B

Refusal vector weights ready for use:
grimjim/Llama-3-Instruct-abliteration-OVA-8B
grimjim/Llama-3-Instruct-abliteration-LoRA-8B

anakin87

Jul 4

Nice!

Have a look at my rap model (built with the same approach as MopeyMule): https://huggingface.co/anakin87/yo-Llama-3-8B-Instruct

grimjim

Jul 6

•

edited Jul 6

Something odd is happening when merging with the OVA model. It will reduce refusals at medium (0.5-0.6) weight, but at full (1.0) weight against Instruct 8B, the result is incoherent. The LoRA should work, though!

It should also be possible to modify abliteration scripts to instead directly produce a LoRA as output.

wunein

Jul 7

https://github.com/cmnfriend/O-LoRA

Bring me throught of O-LoRA, last year's EMNLP paper.

In this post