I'm trying to run this using oobabooga but I'm getting 0.17 tokens/second.
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models\llama-13b-4bit-128g.
Modify your start-webui with this line:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128
I got this Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 7.08 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 13.07 seconds (0.00 tokens/s, 0 tokens, context 43)
I have 64GB of RAM and 8GB of VRAM.
Someone mentioned on oobabooga's repository issues that you need to also use the "pre_layer" flags in order to not completely allocate your GPU with the model and allow part of its VRAM to be used for text generation. The higher the "pre_layer" number, the faster the model will respond but also the more likely it'll run out of VRAM. I used my "pre_layer" parameter on value 26 so it's a bit slow but still manageable. Depending on how big the text history is, VRAM may still run out, I tried messing with parameters but still no success so far. Anyways the line should look like this:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128 --pre_layer 26
Note: If anyone reading this is getting CPU out of memory (not GPU), try increasing the virtual memory on your OS to over 100GB.
Try again with this line:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128 --gpu-memory 7 --pre_layer 19
This will limit the amount of VRAM usage, works on my RTX 3070 w/8GB VRAM.
I'm getting an average of 0.17 tokens/second, is this normal?