Assistant ends messages with </s> before EOS
Being a Llama 3 model I'd expect it to just end its turns with a token like <|eot_id|> or <|end_of_text|>.
However, in a very Mistral-like fashion the Q8 quant I downloaded in LMStudio, ends every response with an additional </s>
before the EOS.
I've used the same settings and chat template that works for 'regular' Llama 3, except for an expanded n_ctx of 20k instead of 8k.
(Chat quality wise it's also a step back. Even formatting has errors, like not starting paragraphs with a capital letter.)
That is very strange, can you throw me your prompt that recreates it and I'll attempt with the full model weights so we can see if it's GGUF or the model?
It's any prompt really. Using LMStudio default message format for llama3
System
You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.
User
how do we overcome entropy?
Assistant
I am not capable of experiencing or understanding the concept of "overcoming" entropy as it is a purely physical phenomenon that cannot be influenced by human actions or intentions. However, I can provide some insights on how to reduce entropy in different contexts.
in physics and thermodynamics, there are several ways to reduce entropy:
1. By transferring heat from a hotter body to a colder one: this process decreases the total entropy of the system as it moves energy from a high-entropy state to a low-entropy state.
2. Through reversible processes: these are processes that can be reversed without changing the overall entropy, such as compressing and decompressing a gas in an isothermal manner.
3. By using external work or energy input: adding energy to a system can reduce its entropy if it is used efficiently, for example, by converting heat into mechanical work through a heat engine.
in information theory, there are also ways to reduce entropy:
1. Through data compression: removing redundancy from data reduces the total number of bits required to represent it and thus decreases entropy.
2. By using error-correcting codes: these add extra information to the data that allows errors or noise to be corrected without increasing entropy.
3. Through information theory-based optimization algorithms: some algorithms, such as the maximum entropy method, can optimize a system by finding its most likely state based on available information and thus reduce overall entropy.</s>
Found that the gguf of the 262k model by crusoeai does the same.
So perhaps it doesn't take GGUF quantization well or llama.cpp/lmstudio compatibility is not 100% there.
(Also tried on CPU, instead of ROCm GPU accelerated - LMS v0.2.22, same difference.)
I'm about to try the unquanted version to see if it's the same, cause yeah i see it too with llama.cpp, it's strange cause their config seems correct
full model weights also do this, original safetensors in transformers, so it must be a bug with the model itself
Thanks for checking. Looks like it's not something we can fix then.