q1?

#2
by AS1200 - opened

Will there be q1? Even with all the desire, it seems that q2_k will not be able to run even with 128 gigabytes of RAM. I understand that with 64 RAM, I won't be able to work even with q1, but I'm going to upgrade to 96 RAM.

Owner

Eventually there will be smaller ones. But even at Q2_K the model performance is pretty bad.
I'll work on creating a proper Importance Matrix for the model and use that to requantize in the future. Don't expect anything in the next couple of days tho.

Owner

If you just want to test it, you can still just try it. llama.cpp can work with a mmap of the model and doesn't need the full model in RAM. Since it's a MoE model, you don't need all weights for each token, just around 86B Parameters of the weights need to be active at a time. So if you're just slightly under, there's a good chance it'll be fine.

how to write code to use mmap

Sign up or log in to comment