What kind of hardware environment do you use?

#14

by bobospace - opened Mar 26

Mar 26

I run grok-1-IQ3_XS-split-00001-of-00009.gguf model on my M3 Max 128g MBP with command line
"./server -m grok-1-IQ3_XS-split-00001-of-00009.gguf --port 8888 --host 0.0.0.0 --ctx-size 1024 --parallel 4 -ngl 999 -n 512"
but give me 0.02 tokens per second

Arki05

Owner Mar 27

I'm on a Threadripper with 256G RAM, with no Apple experience, but have a look at this.
It might just be that your're out of RAM, 128G not much for Grok. I have smaller quants incoming soon.

bobospace

Mar 27

Thanks. The really wired is I compile llama.cpp with metal support and run with -ngl 99, still really slow but RAM just 50% usage.

bobospace

Mar 27

If I merge those splited files into one gguf format file, can I use ./gguf-split --merge to do it?

Arki05

Owner Mar 27

Yes gguf-split --merge should merge the files. That won't change anything about your memory issues tho.
Maybe look into mmap and how Memory gets reported (cache vs process memory).

simsim314

Mar 27

•

edited Mar 27

To rent on vast.ai or runpod.io (for 2-3 bit quants):
2xH100 or 2xA100 80GB.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment