Llama 2 7B quantized in 2-bit with GPTQ.

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
w = 2
model_path = meta-llama/Llama-2-7b-hf

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 4096)
quantized_model = quantizer.quantize_model(model, tokenizer)

Downloads last month: 1,085

Safetensors

Model size

723M params

Tensor type

I32

FP16

Inference Examples

Text Generation

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including kaitchup/Llama-2-7b-hf-gptq-2bit

GPTQ

Collection

Llama 2 7B, 13B, Llama 3 8B, and Mistral 7B quantized with GPTQ in 2-bit, 3-bit, 4-bit and 8-bit with GPTQ. • 16 items • Updated May 6