metadata

base_model:
  - elinas/Llama-3-15B-Instruct-zeroed
library_name: transformers
tags:
  - mergekit
  - merge
  - finetune
datasets:
  - Chat-Error/Pure-dove-sharegpt
license: llama3

Llama-3-15B-Instruct-zeroed-ft-v2

This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.

The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed

This was primarily an experiment to see how a passthrough merge will respond to further finetuning of all LoRA modules.

The model was finetuned on 8192 context length and is likely reliable using RoPE up to 32k.

Further finetuning this model or finetuning the base model on more samples is encouraged.

This will be conducted by myself on the 3rd iteration of this model. Until I receive sufficient feedback on comparison between 8B, this finetune will be on hold.

Datasets

Chat-Error/Pure-dove-sharegpt

A small, high quality, curated dataset was used as a PoC / validation on stabilizing the model after the original passthrough merge.

Finetuning details

This is a QLoRA model and all of the LoRA modules were targeted this time to ensure sufficient training before moving on to larger datasets. the first version of this model only targeted o_proj and up_proj

lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head

The model is coherent even with training the "zeroed" layers plus the additional layers, as this was the recommendation from Charles Goddard (mergekit developer) - thank you for sharing the method of merging as well as Toasty Pigeon for bringing it to my attention!

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 3
- total_train_batch_size: 3
- total_eval_batch_size: 3
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 25
- num_epochs: 1

Optimizer paged_adamw_8bit and Deepspeed ZeRO 3 was used at a LR of 1e-5 using the cosine scheduler for 1 epoch on 3x3090s taking 4 hours total.

Unsloth was used for speed and memory savings.

Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.

W&B Run Summary

wandb:                eval/loss 0.90895
wandb:             eval/runtime 463.4688
wandb:  eval/samples_per_second 0.833
wandb:    eval/steps_per_second 0.278
wandb:               total_flos 8270790524928.0
wandb:              train/epoch 1.0
wandb:        train/global_step 1157
wandb:          train/grad_norm 7.3847
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.8702
wandb:               train_loss 0.87814
wandb:            train_runtime 16425.2713
wandb: train_samples_per_second 0.211
wandb:   train_steps_per_second 0.07

Framework versions

PEFT 0.10.0
Transformers 4.40.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

Model Evaluation

TBD

If you have any questions or comments on the model, feel free to open a discussion in the community tab.