first model quanted @ nico1

#1
by mradermacher - opened

@nicoboss this is the first model actually quanted on your box. I'll have problems controlling upload bandwidth (it's a shell script doing it and it is used to start multiple uploads in parallel, so simply mlimiting speed of one huggingface-cli call will not help), so bear with me for while.

I don't intend to regularly quant things due to the upload bandwidth, but I'll see how quanting the slower and smaller 405B quants will work out. Hopefully.

mradermacher changed discussion status to closed

@mradermacher Awesome to hear. Feel free to quantize on my box as often as you want. No need to ever add any bandwidth limits on your side. I configured my OpenWrt router so it prioritizes traffic in a way that allows you to max out booth download and upload traffic without impacting anyone. I could easily add an upload bandwidth limit on the router side but prefer not to as I'm satisfied the way it is. Last week I maxed out my upload speed for over 24 hours uploading over 1 TB and can confirm that the issue of my ISPs Gateway crashing under such workloads seams to finally be fixed.

Sigh, and a setback - zfs does not support zero-copy operations, so I have to re-write my tooling to do normal copies for larger files. Or maybe it's an opportunity telling me I should rsync the files out first anyway, as chances of a successful upload to huggingface are so low, and retrying would be a disaster.

I could easily migrate your LXC container to a different file system. What file system do you want? LVM, LVM-Thin, ext4, xfs or a different one? I don't think LVM makes sense as you are using the entire SSD anyways. Migrating to a different file system will require 1-2 hours of downtime as it requires all data of your LXC container to be copied twice (to temporary M.2 SSD and then back to your current one).

I wouldn’t say chances of a successful upload to HuggingFace are low. My internet is quite stable. Just do it the way you prefer.

In the meantime, I implemented a normal copy. I will take you up on that offer if this works out to be reasonable.

The problem is not your internet connection, the problem is huggingface itself. While it has improved, I still regularly get internal s3 errors (don't have an example, but it was common for huggingface to fail because s3 complains that the uploaded chunk did not have the correct size) - I suspect if somebody has bad internet, it's huggingface, I mean aws. The problem is that that means a complete re-upload, which is would be such a waste.

That was one of the reasons why I started to upload per-quant, not per-model, because the chances of a successful upload of a 1TB repo was essentially zero - it did work because you did make progress over time (the files are hashed before upload, and single files that were uploaded earlier are cached for a while on the hf side, but hashing the files can be as slow as uploading).

Anyway, I'll see how it works out. I am starting with IQ1 uploads of the llama 405b-instruct model. And probably continue tomorrow. I also have no time-scheduling for quants yet, and quantising finally really taxes your cpu :)

Anyway, also, thanks for being so very helpful. My software is normally rather more portable, but in this case, I "knew" it would only ever run on my systems :() Those decisions always come back :)

OTOH, moving the quantisation to your side was rather easy - the only remaining issues are caused by me securing the network a bit better, so my vm on your side doesn't have the ability to call out or download form my side anymore. But a bit of manual work (copying the imatrix file back that is helpfully getting deleted by my other job scheduler...) for the very few really big models is fine.

Recently, these plague me:

{"error":"Internal Error - We're working hard to fix this as soon as possible!"}

modern cloud-based websites.

Also, llama.cpp is horribly inefficient. Instead of, say, quanting one tensor per thread and parallelising that, it does a primitive read/quantize/write loop, which means it does only utilize ~50% of the cpu power, the rest is waiting for I/O time. But hey, at least the tensor quantisation itself runs in parallel and is super fast11!!

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  6   1  93   0   0| 148M  109M|   0     0 |   0     0 | 139k  184k
 11   2  87   0   0|1200M    0 | 264B 1114B|   0     0 |  28k   22k
 99   0   1   0   0|   0  5736k|  66B  350B|   0     0 |  61k 1780 
 76   0  23   0   0| 248M   16k| 132B  492B|   0     0 |  51k 5348 
  0   2  98   0   0|1374M  548M|  66B  350B|   0     0 |  27k   36k
  0   2  98   0   0|1414M    0 |  66B  350B|   0     0 |  20k   23k
 76   0  24   0   0| 293M    0 | 198B  674B|   0     0 |  53k 6347 
 99   0   1   0   0|  48k 5248k|  66B  456B|   0     0 |  61k 3034 
100   0   0   0   0| 256k   16k|1800B 1988B|   0     0 |  60k  525 
100   0   0   0   0| 768k 1824k|4502B 4882B|   0     0 |  61k 1630 
100   0   0   0   0|   0     0 |  66B  350B|   0     0 |  62k 1242 
 34   1  65   0   0| 810M    0 | 396B 1172B|   0     0 |  37k   16k
 73   0  26   0   0| 342M    0 | 198B  674B|   0     0 |  51k 6775 
 78   0  21   0   0| 267M 1896k| 132B  492B|   0     0 |  53k 6168 
 44   1  55   0   0| 758M    0 | 198B  674B|   0     0 |  39k   13k
 89   0  11   0   0| 128M  579M| 264B  816B|   0     0 |  62k   15k
  1   2  97   0   0|1357M    0 | 132B  492B|   0     0 |  22k   23k
  0   2  98   0   0|1379M  736k|  66B  350B|   0     0 |  20k   23k
 54   1  45   0   0| 592M 6256k| 900B 1376B|   0     0 |  45k   13k
 98   0   2   0   0|   0     0 |  66B  562B|   0     0 |  59k  796 
100   0   0   0   0| 120k  170M|  66B  342B|   0     0 |  62k 5328 
100   0   0   0   0|  80k    0 |  66B  350B|   0     0 |  62k 2589 
100   0   0   0   0|   0     0 |  66B  350B|   0     0 |  61k  594 
 41   1  58   0   0| 777M 5080k| 132B  484B|   0     0 |  37k   14k
  0   2  98   0   0|1386M    0 |  66B  350B|   0     0 |  19k   23k
 13   2  85   0   0|1165M  431M| 198B  674B|   0     0 |  32k   31k
 97   0   3   0   0|   0     0 |  66B  350B|   0     0 |  60k 1225 
100   0   0   0   0|   0     0 |  66B  342B|   0     0 |  62k 1854 
100   0   0   0   0|   0  6128k| 232B  384B|   0     0 |  61k 1369

You actually hadn't mentioned this before - you only mentioned adding another rpc server, not that this is some special case/bug inside llama.cpp. Also, it's not true - if it loaded the whole model into RAM it would either crash or swap large parts of it out, neither of which has happened.

In any case, I'll go to sleep soon, and will check when I get up and likely re-attempt the imatrix. It seems we are close.

Completely unrelated btw., the 1T has an identity crisis:

llama_model_loader: - kv 2: general.name str = BigLlama 3.1 681B Instruct
llama_model_loader: - kv 8: general.base_model.0.name str = Meta Llama 3.1 681B Instruct

I wonder how many config.json files have the wrong data (assuming that's where llama.cpp took it from)

You actually hadn't mentioned this before - you only mentioned adding another rpc server, not that this is some special case/bug inside llama.cpp. Also, it's not true - if it loaded the whole model into RAM it would either crash or swap large parts of it out, neither of which has happened.

This bug was the reason why the thing kept streaming from SSD instead of running from RAM. The host just allocated like 770 GB of virtual memory which it then streamed from SSD for every token. Luckely offloading all the layers to RPC servers fixed this.

In any case, I'll go to sleep soon, and will check when I get up and likely re-attempt the imatrix. It seems we are close.

I will be on a hike tomorrow and have no other services running. So it would be a great oportunity to do BigLlama-3.1-1T-Instruct.IQ4_XS.gguf should we not get the K6 version working over RPC. If you need mlock just reboot your container but IQ4_XS should easely fit in RAM + GPU memory if you make use of all avilable GPUs.

Completely unrelated btw., the 1T has an identity crisis

Yes I noticed this as well. Likely the model other forget to change it when creating the 1T merge.

The host just allocated like 770 GB of virtual memory which it then streamed from SSD for every token. Luckely offloading all the layers to RPC servers fixed this.

It's a fascinating bug, in that what would cause it to actually access all the tensor data. But, no gian in worrying too much.

If you need mlock just reboot your container

Right, good reminder, the time for reboot is now.

do BigLlama-3.1-1T-Instruct.IQ4_XS.gguf should we not get the K6 version working over RPC.

I am basically waiting for you to tell me the rpc servers are up and whether you alreaday ran inference or whether I have to do it first. My parameters will be:

     "extra_args" : "--rpc 192.168.200.139:7139,192.168.200.138:7138,192.168.200.137:7137 --no-check-tensors -ngl 999",

Anyway, signing out for good. Have fun tomorrow and don't fall down these tall mountains over there :)

Right, good reminder, the time for reboot is now.

Awesome I can confirm that the reboot worked and the new settings got applied.

I am basically waiting for you to tell me the rpc servers are up and whether you alreaday ran inference or whether I have to do it first.

I'm testing running imatrix on the RPC servers right now. Sorry it all takes really long as for every time something goes wrong I have to first run inference which takes half an houer to load and then run imatrix which takes another half houer to load. Last time I tried I Ctrl & C exited inference instead of specifying a low number of tokens to generate which caused one of the RPC servers to close as well requering me to restart the entire process. It is now all working and currently loading imatrix so we should soon know if it works if nothing goes wrong.

My parameters will be: "extra_args" : "--rpc 192.168.200.139:7139,192.168.200.138:7138,192.168.200.137:7137 --no-check-tensors -ngl 999",

This is correct. This is what I currently use:

CUDA_VISIBLE_DEVICES=1 ./llama-cli -m /mradermacher/tmp/BigLlama-3.1-1T-Instruct.Q6_K.gguf -p "Hi" --repeat-penalty 1.0 -c 512 -n 3 --rpc 192.168.200.139:7139,192.168.200.138:7138,192.168.200.137:7137 -ngl 1000
CUDA_VISIBLE_DEVICES=1 ./llama-imatrix -m /mradermacher/tmp/BigLlama-3.1-1T-Instruct.Q6_K.gguf -f calibration_datav3.txt --rpc 192.168.200.139:7139,192.168.200.138:7138,192.168.200.137:7137 -ngl 1000

I just gave you SSH access to all the RPC servers. On nico1 just execute the following commands to access them:

The RPC server is running inside tmux. Enter tmux attach to attach the tmux session and Ctrl &B then D to detach it again.

imatrix computation over RPC works but without GPU accelleration it is just too slow as it takes 1.5 houers per pass.

llama_kv_cache_init: RPC[192.168.200.139:7139] KV buffer size =   724.00 MiB
llama_kv_cache_init: RPC[192.168.200.138:7138] KV buffer size =   384.00 MiB
llama_kv_cache_init: RPC[192.168.200.137:7137] KV buffer size =   148.00 MiB
llama_new_context_with_model: KV self size  = 1260.00 MiB, K (f16):  630.00 MiB, V (f16):  630.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model: RPC[192.168.200.139:7139] compute buffer size =   273.00 MiB
llama_new_context_with_model: RPC[192.168.200.138:7138] compute buffer size =   273.00 MiB
llama_new_context_with_model: RPC[192.168.200.137:7137] compute buffer size =   305.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   273.01 MiB
llama_new_context_with_model: graph nodes  = 10086
llama_new_context_with_model: graph splits = 4

system_info: n_threads = 32 (n_threads_batch = 32) / 62 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 40.213 ms
compute_imatrix: computing over 125 chunks with batch_size 512
compute_imatrix: 5450.44 seconds per pass - ETA 189 hours 15.07 minutes
[1]17.6198,[2]14.6297,

I'm on my hike now so I recommend to use this oportunity to do BigLlama-3.1-1T-Instruct.IQ4_XS.gguf on RAM + GPUs because without GPU acceleration RPC is unfeasable. I will research the possibility of RPC GPU acceleration tomorrow but it is probably not possible as when I tried it only worked if all offloaded layers could be stored in GPU memory.

IT was an exciting attempt - I am starting a new discussion topic.

Sign up or log in to comment