More explanation on performance?

#1
by sdalemorrey - opened

On the model card you say, "both f16.q6 and f16.q5 are smaller than q8_0 standard quantization and they perform as well as the pure f16."

A couple of things, I can't see any f16.q6 or f16.q5 in the repo. Are those coming soon?
Also can you explain the performance, or at least elaborate a bit more? Are we talking benchmarks, speed, etc?

Great job by the way and thanks!

Owner

On the model card you say, "both f16.q6 and f16.q5 are smaller than q8_0 standard quantization and they perform as well as the pure f16."

A couple of things, I can't see any f16.q6 or f16.q5 in the repo. Are those coming soon?
Also can you explain the performance, or at least elaborate a bit more? Are we talking benchmarks, speed, etc?

Great job by the way and thanks!

Sorry for the confusion (lol I sounded like an LLM): the files in this repo are all with f16 output and embed tensors. (except the q8_p).

Sign up or log in to comment