Update README.md

75daacd verified 3 months ago

5.04 kB

	---
	license: llama3.1
	datasets:
	- nothingiisreal/Reddit-Dirty-And-WritingPrompts
	- Nopm/Opus_WritingStruct
	- kalomaze/Opus_Instruct_25k
	- Gryphe/Sonnet3.5-SlimOrcaDedupCleaned
	---

	Gate lifted, yay! People liked the model even though its a test model thats underfit, still cost us 80 USD tho lmao. FP8 is [here](https://huggingface.co/nothingiisreal/L3.1-70B-Celeste-V0.1-FP8)

	Please do give the V1.9 card a read [here](https://huggingface.co/nothingiisreal/MN-12B-Celeste-V1.9)

	Recommended system prompt is same as V1.9

	70B seems to have a bit more GPT-ish terminology than 12B, but also less slopping. It is still less than other 70Bs.

	Temp 1.25 seems to improve the prose, recommended sampler:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/630cf5d14ca0a22768bbe10c/5BkFd5FromVfT8ZeTml_2.png)

	It seems to be way more coherent and aware of whats going on as well as more intelligent.

	The model seems to give out what you give in, sloppy card or first message leads to more of the same. The model is quite good at taking a human written card with stuff like conversational narration, and then continue that style.

	It was trained on 4xH100 NVL for 6 hours using Lora+. I still want to train it further because it seems like the more data we put in, the better the model gets at writing and roleplaying.

	Test and see I guess.

	Me and my teammate are sick rn xD and I am currently working with another teammate on some good stuff, we can finally break away from AI generated datasets, at least for the most part. Once it is done, the 8B, 12B and 70B will be used with that dataset to train with. I hope we succeed at this, it will make me so, so happy.

	We are also experimenting with RLHF, KTO and PPO mainly.

	When we do a proper release, it will have a lot of writeup.

	---

	Datasets used:

	# Name, sample size, whether to force RP format, whether to apply len limit (for the first message, seq len limit is always applied), unkown_boolean, minimum message count, system message
	- Reddit WP
	["reddit_writing_prompts.jsonl", 0.4, True, True, False, 2, "Write a story based on prompt provided by user below. Mode: SFW"],<br>
	- Instruct
	["combined_25k_HOTFIX_declauded_englishonly_sysprompt_name_swap.jsonl", 0.1, False, True, False, 2, ""],<br>
	["slim-orca.json", 0.1, False, True, False, 2, ""],
	- Synth story
	["writing-struct-deslopped.json", 0.1, False, True, False, 2, ""],<br>
	Claude RP 0.8

	Thank you Nopm, Gryphe (double thanks), and kalomaze, and any other people involved in making those datasets. r/DirtyWritingPrompts was dropped because it would induce undesirable features. No worries though, NSFW will be stronger than ever lmao.

	We used 10,000 rows, so take those ratios, normalise them so they add up to 1 and then that will be the division of the dataset. You can find all datasets by googling them, they are on huggingface, Claude RP is c2 logs but we filtered it ourselves.

	---

	Axolotl Config:

	```yaml
	# Model
	base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
	model_type: LlamaForCausalLM
	tokenizer_type: AutoTokenizer
	trust_remote_code: true

	# Output and HuggingFace
	output_dir: /workspace/data/train-results/trained_model
	hub_model_id:
	hf_use_auth_token: true
	hub_strategy: "all_checkpoints"

	# WandB
	wandb_project: huggingface
	wandb_entity:

	# Data
	chat_template: llama3
	train_on_inputs: false
	group_by_length: true
	datasets:
	- path:
	type: sharegpt
	roles:
	input:
	- system
	- user
	output:
	- assistant
	## Evaluation
	val_set_size: 0.01
	evals_per_epoch: 4
	eval_table_size:
	eval_max_new_tokens: 128

	# Technical aspects
	sequence_len: 8192
	save_safetensors: true
	saves_per_epoch: 2
	logging_steps: 1
	special_tokens:
	pad_token: <\|end_of_text\|>

	# Quantization
	bf16: auto
	fp16:
	tf32: false
	## For LoRA
	load_in_8bit: false
	load_in_4bit: true

	# LoRA
	adapter: qlora # or qlora
	lora_model_dir:
	lora_r: 256
	lora_alpha: 256
	lora_dropout: 0.1
	lora_target_linear: true
	lora_fan_in_fan_out:
	lora_target_modules:

	loraplus_lr_ratio: 8
	loraplus_lr_embedding:

	# Training hyperparameters
	# max_steps:
	num_epochs: 1 # TODO Perhaps reduce this because LORA+ only needs 1 epoch.

	# Anti Overfit and Stability
	weight_decay: 0.01
	max_grad_norm: 1.0 # Might increase this to 15 or something.

	## Learning Rate
	warmup_ratio: 0.05
	learning_rate: 0.000008
	lr_scheduler: cosine_with_min_lr
	lr_scheduler_kwargs:
	min_lr: 0.0000024
	optimizer: paged_adamw_8bit # usually adamw_torch or paged_adamw_8bit

	## Batch Size
	gradient_accumulation_steps: 1
	micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
	eval_batch_size: 1

	# Optimizations
	pad_to_sequence_len: true
	sample_packing: true
	eval_sample_packing: true
	flash_attention: true
	xformers_attention:
	gradient_checkpointing: "unsloth"
	gradient_checkpointing_kwargs:
	use_reentrant: true
	local_rank:
	deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
	# Misc
	early_stopping_patience:
	debug:
	```