JAIS model response takes a lot of time
Hi, I'm attempting to execute the JAIS model on Colab utilizing 52 GB RAM. I've tested it on various GPUs, such as the A100, V100, and T4, in addition to the TPU. However, predictions are taking an inordinately long time based on the code snippet I ran. Could you assist me with this issue, please?
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inception-mbzuai/jais-13b"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, offload_folder="offload")
def get_response(text,tokenizer=tokenizer,model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=200-input_len,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response
text= "عاصمة دولة |لإمارات العربية المتحدة ه"
print(get_response(text))
We have tested it with 60GB RAM , it works fine. You may try loading it in lower precision as mentioned here
@samta-kamboj from where can i change it to lower precision ? Are there any parameters?
You can use "torch_dtype" as
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)
I think you can load it with oobabooga if you have an 8GB GPU. Make sure to select load-in-4bit option to load it with less vram. I have 24GB gpu and the model is running with me. But I have a problem with token limitation.