Deploy this model using TGI
#4
by
nielsr
HF staff
- opened
Hi,
I'd like to deploy this model on 2 L4 GPUs (which should be possible given that this gives you 48GB of RAM - this model is 35B parameters, hence 35/2 = 17.5GB in 4 bit).
I'm following this guide, except that I'm deploying this model instead of Mistral-7B on 2 L4 GPUs. Here's my TGI configuration:
env:
- name: MODEL_ID
value: CohereForAI/c4ai-command-r-v01-4bit
- name: PORT
value: "8080"
- name: QUANTIZE
value: bitsandbytes-nf4
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
name: data
This fails with:
"Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"name":"warmup"},"spans":[{"max_batch_size":"None","max_input_length":1024,"max_prefill_tokens":4096,"max_total_tokens":2048,"name":"warmup"
Shouldn't this work given the 48GB of RAM? Ideally I'd like to use a context window which is as large as possible.