Running an inference server using Docker + vLLM
#2
by
YorelNation
- opened
Hi,
Would it be possible to somehow deploy vigostral the same way we can deploy mistral via their recommended method: https://docs.mistral.ai/quickstart
Can I simply run:
docker run --gpus all
-e HF_TOKEN=$HF_TOKEN -p 8000:8000
ghcr.io/mistralai/mistral-src/vllm:latest
--host 0.0.0.0
--model bofenghuang/vigostral-7b-chat
I don't have the hardware to try this yet that's why I'm asking :)
Thanks
Hi,
Thanks for your message. I will look into it :)
Hi @YorelNation ,
The Mistral AI version has not yet been updated to support the prompt format of the Vigostral model.
However, I have managed to create another Docker image that also leverages vLLM for inference. You can use it as follows:
# Launch inference engine
docker run --gpus '"device=0"' \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/bofenghuang/vigogne/vllm:latest \
--host 0.0.0.0 \
--model bofenghuang/vigostral-7b-chat
# Launch inference engine on mutli-GPUs (4 here)
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/bofenghuang/vigogne/vllm:latest \
--host 0.0.0.0 \
--tensor-parallel-size 4 \
--model bofenghuang/vigostral-7b-chat
# Launch inference engine using the quantized AWQ version
# Note only supports Ampere or newer GPUs
docker run --gpus '"device=0"' \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/bofenghuang/vigogne/vllm:latest \
--host 0.0.0.0 \
--quantization awq \
--model TheBloke/Vigostral-7B-Chat-AWQ
# Launch inference engine using the downloaded weights
docker run --gpus '"device=0"' \
-p 8000:8000 \
-v /path/to/model/:/mnt/model/ \
ghcr.io/bofenghuang/vigogne/vllm:latest \
--host 0.0.0.0 \
--model="/mnt/model/"
Thanks ! Will try this