Hallucinations, misspellings etc. Something seems broken?
I'm trying to run a few benchmarks with this model and it's not behaving. I'm seeing hallucinations, misspellings, poor instruction following and bad benchmark scores. I've tried gemma-2-9b-it and it's fine.
So far I've tried inferencing with:
- Full 16 bit precision (transformers)
- Bitsandbytes 8-bit
- All the ggufs available (via llama.cpp)
- Via the huggingface pro api
I'm using the tokenizer's chat template.
----eq-bench Benchmark Complete----
2024-06-28 10:46:02
Time taken: 10.1 mins
Prompt Format:
Model: google/gemma-2-27b-it
Score (v2): 49.16
Parseable: 129.0
! eq-bench Benchmark Failed
Yes we are investigating what went wrong! Note that float16 should not be used for this model
Yes we are investigating what went wrong! Note that float16 should not be used for this model
Is it okay to use int4 and bfloat16 to infer?
bfloat16 should be good to infer, but I haven't tested int4 so I'm not sure what the quality will be like.
Yes we are investigating what went wrong! Note that float16 should not be used for this model
I tried with float16 and it only output pad tokens. I changed it to bfloat16 and it works fine!
bfloat16 should be good to infer, but I haven't tested int4 so I'm not sure what the quality will be like.
Thank you.
I'm trying it locally with int4 and it seems to have quite a performance hit, at least in Korean. (My experience with it in Ai studio was pretty good.)
It can't follow the instructions very well and the tokens are getting squashed.
Still, the experience of running the 27B model locally is great!
A single-turn conversation uses about 18GB of VRAM.
Longer context lengths can exceed the RTX 4090's 24GB VRAM.
@sam-paech does 'all the GGUFs available' include the ones I posted this morning? just want to double check
@sam-paech does 'all the GGUFs available' include the ones I posted this morning? just want to double check
Yep. It went:
transformers 16 bit > bitsandbytes 8bit > gguf Q8_0 (yours and others')
in rough order of quality. They all seem various degrees of broken though.
interesting, i wonder if it's the lack of logit softcap or if something else is playing a role
is the transformers bf16 totally fine or does it also experience unexpected degradation?
interesting, i wonder if it's the lack of logit softcap or if something else is playing a role
is the transformers bf16 totally fine or does it also experience unexpected degradation?
It wasn't fine. I'm about to test the latest patch though, will let you know.
interesting, i wonder if it's the lack of logit softcap or if something else is playing a role
is the transformers bf16 totally fine or does it also experience unexpected degradation?
Ok it seems fixed since the latest transformers patch.
https://github.com/huggingface/transformers/pull/31698
We will have to wait for a llama.cpp patch
so that must be it then, it's the soft-cap
wonder how easy that is the implement
I'm getting [end of text] after a reasonable length of time most of the time
I have had some gens that just seem to keep going, but they're outliers
there is a PR for llama.cpp to fix the soft-cap and it requires a new gguf generation: https://github.com/ggerganov/llama.cpp/pull/8197
run on generates in hugging chat too
Reopening since it's not yet fixed in llama.cpp and people may be wondering
The instruction following capability of the bfloat16 version on vLLM seems poor. 9b-it doesn't have such issues. @suryabhupa
Update maybe? It's been 25 days and the model is still broken...
Looks like there were 3 PRs for gemma fixes merged into llama.cpp 3 weeks ago. Vllm v0.5.1 also released around the same time with gemma2 support.
I'm using this with TGI, bfloat16, and the chat template from AutoTokenizer in transformers. The output is very poor quality compared to the same settings with gemma-2-9b-it.
What's wrong with this version?
Is this issue completed?
use eager mode on attention attn_implementation="eager"
FYI
Closing as the above issues seem to be resolved.