TheBloke
/

BLOOMChat-176B-v1-GPTQ

Text Generation

text-generation-inference

Model card Files Files and versions Community

TheBloke commited on Jul 6, 2023

Commit

e9fd8bf

•

1 Parent(s): 6170d2f

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -181,7 +181,7 @@ print(pipe(prompt_template)[0]['generated_text'])
 This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not* work with ExLlama.
-It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to increase inference speed.
 * `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.
@@ -198,7 +198,7 @@ This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not*
 It was created with both group_size 128g and --act-order (desc_act) for increased inference quality.
-**Note** Using group_size + desc_act together can significantly lower performance in AutoGPTQ CUDA. You might want to try AutoGPTQ Triton mode instead (Linux only.)
 * `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.

 This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not* work with ExLlama.
+It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to improve accuracy of responses.
 * `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.
 It was created with both group_size 128g and --act-order (desc_act) for increased inference quality.
+It was created with both group_size 128g and --act-order (desc_act) for even higher inference accuracy, at the cost of increased VRAM usage. Because we already need 2 x 80GB or 3 x 48GB GPUs, I don't expect the increased VRAM usage to change the GPU requirements.
 * `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.