llama-2-7b-chat-marlin
Example of converting a GPTQ model to Marlin format for fast batched decoding with Marlin Kernels
Install Marlin
pip install torch
git clone https://github.com/IST-DASLab/marlin.git
cd marlin
pip install -e .
Convert Model
Convert the model from GPTQ to Marlin format. Note that this requires:
sym=true
group_size=128
desc_activations=false
pip install -U transformers accelerate auto-gptq optimum
Convert with the convert.py
script in this repo:
python3 convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-model" --do-generation
Run Model
Load with the load.load_model
utility from this repo and run inference as usual.
from load import load_model
from transformers import AutoTokenizer
# Load model from disk.
model_path = "./marlin-model"
model = load_model(model_path).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Generate text.
inputs = tokenizer("My favorite song is", return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.batch_decode(outputs)[0])
- Downloads last month
- 2,370
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.