How to get up to 4096 context length?

#3
by Harm - opened

The Mistral-7B-Instruct-v0.1-GGUF card mentions "The model will work at sequence lengths of 4096, or lower."

But when I import the model it only seems to support max context length of 512.
model._llm.context_length --> 512

When I run a larger prompt I get:
WARNING:ctransformers:Number of tokens (850) exceeded maximum context length (512).

How can I utilize the longer context length for the Mistral-7B-Instruct-v0.1-GGUF model?

You have to manually set it up. It’s normally set up for all models ans 512. Also, it should support around 8k context lenght(slightly lower).

You have to manually set it up. It’s normally set up for all models ans 512. Also, it should support around 8k context lenght(slightly lower).

Ok, do you have any suggestions or pointers how to do so?

This comment has been hidden

You can use

pip install llama-cpp-python
wget https://huggingface.co/TheBloke/WizardLM-13B-V1.2-GGUF/resolve/main/wizardlm-13b-v1.2.Q5_K_M.gguf

And after this for example:

from llama_cpp import Llama
llm = Llama(model_path="wizardlm-13b-v1.2.Q5_K_M.gguf", n_ctx=4096, n_gpu_layers=-1)
print(llm(prompt, max_tokens=1024, temperature=0))

Just change name of model and path.

Everyone thanks for the suggestions. Was just pointed to the context_length parameter from ctransformers. Context length is upgraded to 4096 by:
from ctransformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
model_type="mistral",
gpu_layers=50,
hf=True,
context_length=4096)

Harm changed discussion status to closed

Sign up or log in to comment