RDson/Llama-3-5B-Experimental

This is just an experiment similar to that done on chargoddard/llama3-42b-v0. The post-pruning was fine-tuned or "healed" with QLoRA using the code DPO dataset AlekseyKorshuk/evol-codealpaca-v1-dpo. Due to limitations, this was only trained on 3150/4935 (~64%) steps of the data. I had to restart the training about halfway through, so the logs are split in two. I am still unsure if the tokenizer is correct.

Loss: ~1.2

mergekit.yaml

slices:
  - sources:
      - model: ./Meta-Llama-3-8B-Instruct/
        layer_range: [0,15]
  - sources:
      - model: ./Meta-Llama-3-8B-Instruct/
        layer_range: [29,32]
            
merge_method: passthrough
dtype: bfloat16

ORPOConfig

    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_length=1024,
    max_prompt_length=512,
    overwrite_output_dir=False,
    beta=0.1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.02,
    logging_steps=1,
    warmup_steps=50,
    report_to="wandb",
    output_dir=out_dir_folder,
    fp16=True,
    save_steps=50