More explanation on performance?
On the model card you say, "both f16.q6 and f16.q5 are smaller than q8_0 standard quantization and they perform as well as the pure f16."
A couple of things, I can't see any f16.q6 or f16.q5 in the repo. Are those coming soon?
Also can you explain the performance, or at least elaborate a bit more? Are we talking benchmarks, speed, etc?
Great job by the way and thanks!
On the model card you say, "both f16.q6 and f16.q5 are smaller than q8_0 standard quantization and they perform as well as the pure f16."
A couple of things, I can't see any f16.q6 or f16.q5 in the repo. Are those coming soon?
Also can you explain the performance, or at least elaborate a bit more? Are we talking benchmarks, speed, etc?Great job by the way and thanks!
Sorry for the confusion (lol I sounded like an LLM): the files in this repo are all with f16 output and embed tensors. (except the q8_p).