Quantization made by Richard Erkhov.

Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 - GGUF

Model creator: https://huggingface.co/Na0s/
Original model: https://huggingface.co/Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0/

Name	Quant method	Size
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q2_K.gguf	Q2_K	2.59GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.IQ3_XS.gguf	IQ3_XS	2.86GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.IQ3_S.gguf	IQ3_S	2.99GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q3_K_S.gguf	Q3_K_S	2.98GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.IQ3_M.gguf	IQ3_M	3.07GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q3_K.gguf	Q3_K	3.25GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q3_K_M.gguf	Q3_K_M	3.25GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q3_K_L.gguf	Q3_K_L	3.49GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.IQ4_XS.gguf	IQ4_XS	3.63GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q4_0.gguf	Q4_0	3.77GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.IQ4_NL.gguf	IQ4_NL	3.8GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q4_K_S.gguf	Q4_K_S	3.79GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q4_K.gguf	Q4_K	3.97GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q4_K_M.gguf	Q4_K_M	3.97GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q4_1.gguf	Q4_1	4.14GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q5_0.gguf	Q5_0	4.52GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q5_K_S.gguf	Q5_K_S	4.52GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q5_K.gguf	Q5_K	4.62GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q5_K_M.gguf	Q5_K_M	4.62GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q5_1.gguf	Q5_1	4.89GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q6_K.gguf	Q6_K	5.31GB
Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0.Q8_0.gguf	Q8_0	6.87GB

Original model description:

library_name: transformers datasets: - teknium/openhermes pipeline_tag: text-generation license: apache-2.0 base_model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0

Model Card for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0:

Model Details:

Model Description:

Finetuned from model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0 on teknium/openhermes.
We pruned the 4 layers of meta-llama/Meta-Llama-3.1-8B that had the less impact on the performance of the model according to the paper The Unreasonable Ineffectiveness of the Deeper Layers.
We have therefore 1.09B parameters less than the foundation model, which means less memory needed, faster training and less latency during inference mode.
We then recovered the performance loss induced by the pruning process by fine-tuning (from 0.2642 MMLU-Pro 0-shot to 0.3120), this step is called healing the pruned model.

Upcoming Work:

More healing through SFT/DPO/TPO to see if we can get closer to the meta-llama/Meta-Llama-3.1-8B performance (which has an MMLU-Pro 0-shot of 0.3659). (In Progress)
Evaluate on benchmarks other than MMLU-PRO 0-shot (Unfortunately lighteval is broken right now issue #191, issue #213).
Compare the same exact process when applied to meta-llama/LLama-3.1-70B.

Training Details:

model = FastLanguageModel.get_peft_model(
model,
r = 4, 
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
lora_alpha = 4,
lora_dropout = 0.05, 
bias = "none",    

use_gradient_checkpointing = "unsloth", 
random_state = 3407,
use_rslora = False,  
loftq_config = None, 
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "completion",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, 
args = TrainingArguments(
    per_device_train_batch_size = 10,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    max_steps=5000,
    learning_rate = 2e-4,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    seed = 3407,
    output_dir = "outputs_4",
    push_to_hub=True,
    hub_always_push=True,
),
)

Training Data:

teknium/openhermes

Memory and Latency gain (Using Optimum-Benchmark):

Load Mode Memory Metrics

Model	Max Global VRAM (MB)	Max Process VRAM (MB)	Max Reserved VRAM (MB)	Max Allocated VRAM (MB)
Llama-3.1-8B	18521.98	16630.42	16196.30	16060.54
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	16319.97	14428.41	13994.30	13879.42

Inference Mode Latency Metrics

Model	Latency Mean (s)	Throughput (tokens/s)
Llama-3.1-8B	0.8104	38.2536
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.5530	56.0570

Evaluation:

(Foundation model) MMLU Pro 0-shot of meta-llama/Meta-Llama-3.1-8B: 0.3659
(Pruned model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers: 0.2642
(Healed model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0: 0.3120

Evaluation Data and Process:

TIGER-AI-Lab/MMLU-Pro.
More classic benchmarks like ARC incoming.

Environmental Impact:

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).