How to chat with model via API?
How to chat with model via python API?
You can't. These are delta weights, and you need to apply them to the actual LLama weights to get the OpenAssisant weights.
So when we apply them to the actual LLama weights, can we run it in local to get a kind of chat engine similar to web UI. Or is it totally another type of engineering stuff to do ?
I think you can run it with transformers' automodel, as transformers supports LLama inference
Try text generation web UI. It can load this model quite simply if you have about 70GB of GPU memory. It also allows you to load the weights in 8-bit quantization, which reduces the memory requirement approximately by half.
Try text generation web UI. It can load this model quite simply if you have about 70GB of GPU memory. It also allows you to load the weights in 8-bit quantization, which reduces the memory requirement approximately by half.
70GB of VRAM? What rig do you have?
Try text generation web UI. It can load this model quite simply if you have about 70GB of GPU memory. It also allows you to load the weights in 8-bit quantization, which reduces the memory requirement approximately by half.
70GB of VRAM? What rig do you have?
My work station has 3 GPUs with 24 GB memory each.