elinas/alpaca-30b-lora-int4

Apr 2, 2023

I am learning alpaca models, can you please point to the right direction on what to use to chat with the model using gpu ? Thank you.

elinas

Owner Apr 2, 2023

Please take a look at the README for 2 ways to run inference, including chat (option 2)

elinas changed discussion status to closed Apr 2, 2023

rkj45

Apr 3, 2023

Got this error

CUDA SETUP: CUDA runtime path found: /opt/miniconda/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/miniconda/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading alpaca-30b-lora-int4...
Loading model ...
Traceback (most recent call last):
File "/app/text-generation-webui/server.py", line 276, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/app/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 111, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
File "/app/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 228, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "/opt/miniconda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros", "model.layers.0.mlp.down_proj.qzeros"

rkj45 changed discussion status to open Apr 3, 2023

elinas

Owner Apr 3, 2023

•

edited Apr 3, 2023

Please see https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-1-install-gptq-for-llama

There are breaking changes and you should use commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 in the cuda branch.

elinas

Owner Apr 3, 2023

added an update here https://huggingface.co/elinas/alpaca-30b-lora-int4#update-2023-04-03

LexSong

Apr 4, 2023

Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?

Currently, I can use 468c47c of qwopqwop200/GPTQ-for-LLaMa with old alpaca-30b-4bit.pt. But what version can I use with the safetensor checkpoints? I tried the latest version of GPTQ and it didn't work. Which one should I use to load safetensors?

rkj45

Apr 4, 2023

btw a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 didnt work, got the same error

elinas

Owner Apr 4, 2023

Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?

Yes it is, for the cuda branch. There is also a triton branch but I haven't messed with it.

btw a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 didnt work, got the same error

Do git log and ensure you're on the correct commit. It works fine for me. If you are and it still does not work, try to re-install all of the requirements and run python setup_cuda.py install.

commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 (HEAD -> cuda-stable)
Author: oobabooga <[email protected]>
Date:   Fri Mar 31 00:31:06 2023 -0300

    Move model saving back to the end

rkj45

Apr 4, 2023

Sadly no luck, tried with py 3.9, 3.10 , started from scratch with the right requirements

(textgen) root@9b843d1d1b8e:/app/text-generation-webui# python server.py --model alpaca-30b-lora-int4 --wbits 4
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading alpaca-30b-lora-int4...
Loading model ...
Traceback (most recent call last):
File "/app/text-generation-webui/server.py", line 276, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/app/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 45, in _load_quant
model.load_state_dict(torch.load(checkpoint))
File "/opt/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros"

(textgen) root@9b843d1d1b8e:/app/text-generation-webui/repositories/GPTQ-for-LLaMa# git log
commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 (HEAD -> cuda-stable)
Author: oobabooga [email protected]
Date: Fri Mar 31 00:31:06 2023 -0300

elinas

Owner Apr 4, 2023

Are you using the old .pt model or one of the new safetensors models? The former will not work unless you're on a pretty old commit.

rkj45

Apr 4, 2023

(textgen) root@9b843d1d1b8e:/app/text-generation-webui/models/alpaca-30b-lora-int4# ls -alh
total 49G
drwxr-xr-x 1 root root 4.0K Apr 3 18:10 .
drwxr-xr-x 1 root root 4.0K Apr 3 16:42 ..
drwxr-xr-x 1 root root 4.0K Apr 3 18:10 .git
-rw-r--r-- 1 root root 1.5K Apr 3 16:42 .gitattributes
-rw-r--r-- 1 root root 11K Apr 3 16:42 README.md
-rw-r--r-- 1 root root 17G Apr 3 18:10 alpaca-30b-4bit-128g.safetensors
-rw-r--r-- 1 root root 16G Apr 3 18:07 alpaca-30b-4bit.pt
-rw-r--r-- 1 root root 16G Apr 3 18:04 alpaca-30b-4bit.safetensors
-rw-r--r-- 1 root root 426 Apr 3 16:42 config.json
-rw-r--r-- 1 root root 124 Apr 3 16:42 generation_config.json
-rw-r--r-- 1 root root 47K Apr 3 16:42 pytorch_model.bin.index.json
-rw-r--r-- 1 root root 2 Apr 3 16:42 special_tokens_map.json
-rw-r--r-- 1 root root 489K Apr 3 16:42 tokenizer.model
-rw-r--r-- 1 root root 141 Apr 3 16:42 tokenizer_config.json

elinas

Owner Apr 4, 2023

Only have one checkpoint in your directory that you plan to use.

LexSong

Apr 4, 2023

•

edited Apr 4, 2023

Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?

Yes it is, for the cuda branch. There is also a triton branch but I haven't messed with it.

I found the issue. a6f36e3 is on oobabooga/GPTQ-for-LLaMa and not on the qwopqwop200's original repo.

rkj45

Apr 5, 2023

That worked!, thank you, now I am facing another issue, there seems to be an extra text on every response, do you know what could it be ?
https://prnt.sc/a-NmywcPdKAa

disarmyouwitha

Apr 5, 2023

•

edited Apr 5, 2023

That worked!, thank you, now I am facing another issue, there seems to be an extra text on every response, do you know what could it be ?
https://prnt.sc/a-NmywcPdKAa

Haha, I get this too.. It seems to go away if I try the "example" character card, so I think may be the default parameters.

The Model card seems to have some preferred params^^

elinas changed discussion status to closed Apr 12, 2023

elinas
/

alpaca-30b-lora-int4

Chat interface