Running on RTX 3090
I read that Nemo runs on a RTX 4090. I have a 3090 which I understand has the same size RAM. But when I try the sample code I get an out of memory error.
What do I need to do to try out this model on a 3090?
Hi Alan, the code in the readme is to run at 16 bit precision, it would need around 28gb of VRAM (and yours has 24gb), however! This model was designed to be able to run lossless at 8 bit precision, meaning you can do inference at fp8 precision and would fit in 16gb of VRAM without any issue! I also invite you to look into quantization.
Also note that this repo is for the base model, meaning for raw text completion, for instructions and to chat with the model I recommend the Instruct version.
@pandora-s Is it possible to run this model in fp8 without doing quantization? I tried in vLLM to say the dtype or the kv cache type were fp8 but nothing worked
For anyone else reading this, here is the code I used to run on my local GPU:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "mistralai/Mistral-Nemo-Base-2407"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)
inputs = tokenizer("Hello my name is", return_tensors="pt",return_token_type_ids=False).to('cuda')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Took about 13GB of GPU RAM
I'm pretty sure that's 8 bit integer quantization which is not the quantization the model was trained for. " 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16" from https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization/bitsandbytes.md