elinas commited on
Commit
77156bd
1 Parent(s): 23f12dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - elinas/Llama-3-13B-Instruct
4
+ library_name: transformers
5
+ tags:
6
+ - mergekit
7
+ - merge
8
+ license: llama3
9
+ ---
10
+ # Meta-Llama-3-13B-Instruct
11
+
12
+ This is a QLoRA **finetune** of a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
13
+
14
+ The model is based on my passthrough merge of [Llama-3-13B-Instruct](https://huggingface.co/elinas/Llama-3-13B-Instruct)
15
+
16
+ This was primarily an experiment to see how a passthrough merge will respond to further finetuning, though this was done on a small dataset.
17
+
18
+ The model was finetuned on 8192 context length and is likely reliable using RoPE scaling up to 32k.
19
+
20
+ It still cannot do math reliably; neither can Llama-3-8B, and in my tests only Llama-3-70B can, but it a better storywriter/RP than Llama-3-8B from some side by side testing.
21
+
22
+ ## Dataset
23
+
24
+ * Dataset used
25
+ * [Chat-Error/Pure-dove-sharegpt](https://huggingface.co/datasets/Chat-Error/Pure-dove-sharegpt)
26
+
27
+ A small dataset was used to see how it affects performance. Originally I planned to do a larger dataset (196k samples), but wanted to start with a smaller one first to see how much the model improved with some additional finetuning.
28
+ Next steps would be finetuning on a larger dataset if through further testing, performance improvements are noticed.
29
+
30
+ ## Finetuning details
31
+ This is a QLoRA model and all modules were targeted.
32
+ ```yaml
33
+ lora_target_modules:
34
+ - gate_proj
35
+ - down_proj
36
+ - up_proj
37
+ - q_proj
38
+ - v_proj
39
+ - k_proj
40
+ - o_proj
41
+ lora_modules_to_save:
42
+ - embed_tokens
43
+ - lm_head
44
+ ```
45
+
46
+ ```yaml
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 1e-05
49
+ - train_batch_size: 1
50
+ - eval_batch_size: 1
51
+ - seed: 42
52
+ - distributed_type: multi-GPU
53
+ - num_devices: 3
54
+ - total_train_batch_size: 3
55
+ - total_eval_batch_size: 3
56
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
57
+ - lr_scheduler_type: cosine
58
+ - lr_scheduler_warmup_steps: 25
59
+ - num_epochs: 1
60
+ ```
61
+
62
+ Optimizer `paged_adamw_8bit` and Deepspeed ZeRO 3 was used at a LR of `1e-5` using the cosine scheduler for 1 epoch on 3x3090s taking 4h 12m 13s total.
63
+
64
+ Sample packing and padding was disabled to reduce VRAM consumption significantly at the cost of speed.
65
+
66
+ W&B Run Summary
67
+ ```
68
+ wandb: Run summary:
69
+ wandb: eval/loss 1.00774
70
+ wandb: eval/runtime 535.3847
71
+ wandb: eval/samples_per_second 0.721
72
+ wandb: eval/steps_per_second 0.241
73
+ wandb: total_flos 4167452590080.0
74
+ wandb: train/epoch 1.0
75
+ wandb: train/global_step 1157
76
+ wandb: train/grad_norm 4.50846
77
+ wandb: train/learning_rate 0.0
78
+ wandb: train/loss 1.4115
79
+ wandb: train_loss 1.00352
80
+ wandb: train_runtime 14921.1227
81
+ wandb: train_samples_per_second 0.233
82
+ wandb: train_steps_per_second 0.078
83
+ ```
84
+
85
+ ### Framework versions
86
+
87
+ - PEFT 0.10.0
88
+ - Transformers 4.40.0.dev0
89
+ - Pytorch 2.3.0+cu121
90
+ - Datasets 2.15.0
91
+ - Tokenizers 0.15.0
92
+
93
+ ## Evaluations
94
+
95
+ TBD - submitted
96
+
97
+ [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)