TheBloke commited on
Commit
e9fd8bf
1 Parent(s): 6170d2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -181,7 +181,7 @@ print(pipe(prompt_template)[0]['generated_text'])
181
 
182
  This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not* work with ExLlama.
183
 
184
- It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to increase inference speed.
185
 
186
  * `gptq_model-4bit-128g.safetensors`
187
  * Works with AutoGPTQ in CUDA or Triton modes.
@@ -198,7 +198,7 @@ This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not*
198
 
199
  It was created with both group_size 128g and --act-order (desc_act) for increased inference quality.
200
 
201
- **Note** Using group_size + desc_act together can significantly lower performance in AutoGPTQ CUDA. You might want to try AutoGPTQ Triton mode instead (Linux only.)
202
 
203
  * `gptq_model-4bit-128g.safetensors`
204
  * Works with AutoGPTQ in CUDA or Triton modes.
 
181
 
182
  This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will *not* work with ExLlama.
183
 
184
+ It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to improve accuracy of responses.
185
 
186
  * `gptq_model-4bit-128g.safetensors`
187
  * Works with AutoGPTQ in CUDA or Triton modes.
 
198
 
199
  It was created with both group_size 128g and --act-order (desc_act) for increased inference quality.
200
 
201
+ It was created with both group_size 128g and --act-order (desc_act) for even higher inference accuracy, at the cost of increased VRAM usage. Because we already need 2 x 80GB or 3 x 48GB GPUs, I don't expect the increased VRAM usage to change the GPU requirements.
202
 
203
  * `gptq_model-4bit-128g.safetensors`
204
  * Works with AutoGPTQ in CUDA or Triton modes.