What is your environment like for quantizing models?

#1
by disarmyouwitha - opened

I downloaded your guanaco-13B-GPTQ model and I get nearly 80 tokens/sec using exllama:
https://github.com/turboderp/exllama

python test_benchmark_inference_log.py -d ~/llm_models/guanaco-13B-GPTQ
 -- Loading model
 -- Tokenizer: /home/nap/llm_models/guanaco-13B-GPTQ/tokenizer.model
 -- Model config: /home/nap/llm_models/guanaco-13B-GPTQ/config.json
 -- Model: /home/nap/llm_models/guanaco-13B-GPTQ/Guanaco-13B-GPTQ-4bit-128g.no-act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
 ** Time, Load model: 1.53 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,683.17 MB
 -- Inference, first pass.
 ** Time, Inference: 0.75 seconds
 ** Speed: 2576.64 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 79.34 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 79.71 tokens/second
 ** VRAM, Inference: [cuda:0] 2,254.17 MB
 ** VRAM, Total: [cuda:0] 8,937.34 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 6.3734

I also downloaded the hugging face repo (this one) and did the quantization myself, like how you listed in your model card:

 CUDA_VISIBLE_DEVICES=0 python llama.py /home/nap/llm_models/guanaco-13B-HF  wikitext2 --wbits 4 --true-sequential --groupsize 128 --save_safetensors /home/nap/llm_models/guanaco-13B-HF/guanaco-13B-4bit-128g-no-act-order.safetensors

However my benchmarks are much lower (50tokens/sec) with my quantized version?

CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference_log.py -d ~/llm_models/guanaco-13B-HF/
 -- Loading model
 -- Tokenizer: /home/nap/llm_models/guanaco-13B-HF/tokenizer.model
 -- Model config: /home/nap/llm_models/guanaco-13B-HF/config.json
 -- Model: /home/nap/llm_models/guanaco-13B-HF/guanaco-13B-4bit-128g-no-act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
 ** Time, Load model: 1.91 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 6,689.96 MB
 -- Inference, first pass.
 ** Time, Inference: 0.83 seconds
 ** Speed: 2319.07 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 50.98 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 50.99 tokens/second
 ** VRAM, Inference: [cuda:0] 2,254.17 MB
 ** VRAM, Total: [cuda:0] 8,944.13 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 6.3615

I am using the latest current cuda branch for quantizing -- what branch are you using? (Do you think it matters?) Thanks!

I'm currently quantising with the oobabooga GPTQ-for-LLaMA fork, to ensure compatibility for as many users as possible.

In the near future I expect to move to quantising with AutoGPTQ instead, when it is mature enough to be the default for users.

That's odd that you would get lower performance when quantising yourself versus the file I uploaded. Normally performance is affected by the code used to do inference, not by the code used to quantise in the first place.

Which GPTQ-for-LLaMa did you use specifically?

I was using the master cuda branch from qwopqwop200/GPTQ-for-LLaMa, I think.

Pulling the Ooba cuda branch of GPTQ-for-llama and requantizing with the same command gave results that matched yours - Thank you!

I am going to pull qwopqwop200's latest cuda/triton branches and try again to see if I can replicate the slow results.

OK yeah CUDA branch is known to be slow. Either try the ooba text-gen-ui CUDA fork I listed, or try the latest Triton AutoGPTQ (if you're on Linux)

Or switch to AutoGPTQ which is the future anyway

What are you trying to do specifically? Like are you quantising just to test, or quantising because you want quantised models for yourself? Or because you want to quantise models for others to use?

Depending on what you want to do will vary my answer as to what I'd recommend. So let me know and I can advise more tomorrow.

Well, 2 reasons - I would like to be able to quantize models for myself just incase you don't have it up

But what has really pulled me down this rabbit hole is that I am using this code for inference:
https://github.com/turboderp/exllama

It's lighting fast custom CUDA voodoo as far as I can tell. It is much faster than any of the other branches I have tested (Though I haven't tried AutoGPTQ.. If you have resources?)

BUT.. On my go-to model (TheBloke/koala-13B-GPTQ-4bit-128g) I was only getting 50 tokens/sec compared to the creators 80-90 t/s, and he mentioned PyTorch has a CPU bottleneck so I just thought it was that

UNTIL I tried your (TheBloke/Project-Baize-v2-13B-GPTQ) model and got 70 tokens/sec and then I tried (TheBloke/guanaco-13B-GPTQ) ~70 tokens/sec.. so I discovered it had something to do with the model.

So, I grabbed the HF model and quantized it myself only to get 50 tokens/sec, just like I had got quantizing my own koala.

Using the repo you listed I am getting very consistently fast speeds, so thank you!

A bit of follow up:

I requantized the weights 3 times, using Ooba_cuda, Qwop_cuda, and Qwop_triton and Ooba_cuda was the only one to hit this speed~

Screenshot 2023-05-29 at 1.06.55 AM.png

I was able to get the basic_example.py code to work for AutoGPTQ so I will check this out as well~
I like that it already supports quantization of Falcon (and so many other models!)

Thanks, again!

disarmyouwitha changed discussion status to closed

Sign up or log in to comment