rasyosef/Phi-1_5-Instruct-v0.1 · what is the context window size of this model , i means what is the input token and output tokens of this model

3 days ago

can we chage the input and output tokens of the model

Owner 2 days ago

The model's context size is 2048 tokens, but I don't understand what you mean by "chage the input and output tokens of the model"

naveen237

2 days ago

sorry for the spell mistake, it's actually can i change the input tokens and output tokens of the model

naveen237

2 days ago

If you apply 2-bit quantization to a model, the model size should theoretically decrease since the number of bits per parameter is reduced. For example, going from 32-bit floating-point to 2-bit representation should result in a smaller model size. However, I saw one case where 2-bit quantization was applied to a model, and the size remained 3GB, the same as the original.

As far as I know, quantizing a model reduces the number of parameters, which should lead to a smaller model. But in this case, even after using 2-bit quantization, the model size didn't reduce. why....?

rasyosef

Owner 2 days ago

•

edited 2 days ago

If you look at the sizes of the quantizations of llama 3.1, they keep decreasing as you use fewer and fewer bits. The Q4_1 quant is 5GB and Q2_K is 3.2 GB.

So if you noticed no decrease in size after quantization, I wouldn't know how to explain it. Maybe you should use a different quantization library.

https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main