Actual max RAM usage?
There's no way that a filled 200K context maxes out at only a few gigs more than the model itself, is there? How much RAM/VRAM would it actually take to fill the whole context window?
llama.cpp uses 16bit KV by default.. So for -c 200000
with the 34B you'll need 46.875 GiB for context in addition to whatever memory is needed to load the model. For the 6B 200K , context requires 12.91
GiB.
@finilok Yeah you're right, I haven't ever set that table up to update for the max context length of extended context models. It assumes 4096 context. I have been meaning to look at that for a while.
@KerfuffleV2 awesome, thanks for the details! I thought I was going to have to do some tedious testing of models at different context lengths, but if there's a formula then I can set up some code to calculate it automatically when making the README.
Can you tell me how you got those figures and whether there's a metadata field I could read from the gguf to do that automatically? Which I hope will be straightforward now that we have your great get/set metadata script!
The context length I can already from config.json, but I don't know what model hyperparms I then use to get the figures you've shown.
@TheBloke
I just loaded the model with -c
set and looked at the output:
For example:
6B with -c 200000
:
llama_new_context_with_model: kv self size = 12500.00 MB
34B with -c 50000
:
llama_new_context_with_model: kv self size = 11718.75 MB
34B with -c 10000
:
llama_new_context_with_model: kv self size = 2343.75 MB
2343.75 * 5
is 11718.75
(same as -c 50000
) so -c 200000
must be 11718.75 * 4
= 46875.0
As for calculating it automatically, kind of a pain. I dug into the KV cache init code and came up with this (interactive Python interpreter result):
>>> n_layer=60 # llama.block_count
>>> n_head=56 # llama.attention.head_count
>>> n_head_kv=8 # llama.attention.head_count_kv
>>> n_embd=7168 # llama.embedding_length
>>> n_gqa=n_head/n_head_kv
>>> n_embd_gqa=n_embd/n_gqa
>>> n_ctx=200000
>>> n_elements=n_embd_gqa*(n_layer*n_ctx)
>>> 2 * n_elements * 2 # in bytes
49152000000.0
>>> (2 * n_elements * 2) / (1024 * 1024) # In MiB
46875.0
Also, whoops - it seems like the original values already were MiB even though the output had "MB".
The reason it's 2 * 2 * n_elements
at the end is because there's both k
and v
parts of the KV cache each with n_elements
elements, and each element is 16bit (2 bytes).
BTW, not sure if you noticed but gguf-dump.py
supports JSON output so extracting metadata to JSON format should be pretty easy.
Seems the ctx len memeory cost is going to be an issue in the near future. Just wonder if there is any effor in reducing each coefficient in the formula
2 * n_elements * 2 # in bytes
E.g. using FP4/8 to compress kv cache, or considering the sparsity of kv cache. Just throwing some ideas out of my mind
@Yhyu13 There's this pull to allow quantizing the KV cache to Q8_0 (8bit): https://github.com/ggerganov/llama.cpp/pull/2969
However, it was a big, complicated pull and touched a lot of other stuff. It hasn't been updated in some time, so I think the author may have given up on it. 4bit would probably have a pretty noticeable effect on quality.
Using KCPP set for 32k context limit with the Q5_M quant, context takes about 15GB. Setting KCPP for 64k context (the max it will allow), context alone completely fills a 24GB P40. It doesn't go OOM, but it fails to fully load the model with other cuda errors.
32k context is the most I can successfully do with a pair of P40s using KCPP. Though it nominally works, the output is pure gibberish. A string of random words in a mix of english and chinese. Completely unusable. I don't know if this is an issue with the model itself, the quantization method, some incompatibility in KCPP, or a combination of factors. In any event, it's sadly unusable as-is.
For reference, it took a solid 15 minutes to generate a ~500 token response to a 32k prompt on a pair of P40s. Possibly still worthwhile in some situations, if the output was usable.
@Yhyu13 There's this pull to allow quantizing the KV cache to Q8_0 (8bit): https://github.com/ggerganov/llama.cpp/pull/2969
However, it was a big, complicated pull and touched a lot of other stuff. It hasn't been updated in some time, so I think the author may have given up on it. 4bit would probably have a pretty noticeable effect on quality.
@KerfuffleV2 LMDeploy has actually successfully deploy a KV int8 PTQ method here https://github.com/yhyu13/lmdeploy/blob/main/docs/en/kv_int8.md.
BTW, I found two papers presented by Google on using multi-query attention instead of multi-head attention to reduce mem requirment for attentioin kv cache
https://arxiv.org/pdf/2211.05102.pdf
https://ar5iv.labs.arxiv.org/html/2305.13245
Not sure if any of the open-source model severing platform has adapt to those
Moreover, Flash-attention is method to drop multi-head attention mem scaling from quadratic to linear https://huggingface.co/blog/optimize-llm, but forgive my ignorance, I haven't found any clue that flash-attention would work with attention weight quantization
LMDeploy has actually successfully deploy a KV int8 PTQ method here
The problem from the llama.cpp side wasn't that there wasn't an existing 8bit quantization to use or that quality or anything like that, the issue was making it fit in with the rest of the GGML/llama.cpp code. It also can't really borrow directly from projects like what you linked, most of the other stuff is written in Python while llama.cpp is... written in C++/C.
I found two papers presented by Google on using multi-query attention instead of multi-head attention to reduce mem requirment for attentioin kv cache
llama.cpp already uses that as far as I know, at least for 70B LLaMAv2 models.
I haven't found any clue that flash-attention would work with attention weight quantization
I think it's an approach to dealing with attention and doesn't really care about the exact format of the KV cache. So I think it could work whether the KV cache was f32, f16, or a quantized format.
https://github.com/ggerganov/llama.cpp/pull/4309 added kv cache quantization