What kind of hardware environment do you use?
#14
by
bobospace
- opened
I run grok-1-IQ3_XS-split-00001-of-00009.gguf model on my M3 Max 128g MBP with command line
"./server -m grok-1-IQ3_XS-split-00001-of-00009.gguf --port 8888 --host 0.0.0.0 --ctx-size 1024 --parallel 4 -ngl 999 -n 512"
but give me 0.02 tokens per second
Thanks. The really wired is I compile llama.cpp with metal support and run with -ngl 99, still really slow but RAM just 50% usage.
If I merge those splited files into one gguf format file, can I use ./gguf-split --merge to do it?
Yes gguf-split --merge
should merge the files. That won't change anything about your memory issues tho.
Maybe look into mmap and how Memory gets reported (cache vs process memory).