|
--- |
|
license: llama3.1 |
|
datasets: |
|
- nothingiisreal/Reddit-Dirty-And-WritingPrompts |
|
- Nopm/Opus_WritingStruct |
|
- kalomaze/Opus_Instruct_25k |
|
- Gryphe/Sonnet3.5-SlimOrcaDedupCleaned |
|
--- |
|
|
|
Gate lifted, yay! People liked the model even though its a test model thats underfit, still cost us 80 USD tho lmao. FP8 is [here](https://huggingface.co/nothingiisreal/L3.1-70B-Celeste-V0.1-FP8) |
|
|
|
Please do give the V1.9 card a read [here](https://huggingface.co/nothingiisreal/MN-12B-Celeste-V1.9) |
|
|
|
Recommended system prompt is same as V1.9 |
|
|
|
70B seems to have a bit more GPT-ish terminology than 12B, but also less slopping. It is still less than other 70Bs. |
|
|
|
Temp 1.25 seems to improve the prose, recommended sampler: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630cf5d14ca0a22768bbe10c/5BkFd5FromVfT8ZeTml_2.png) |
|
|
|
It seems to be way more coherent and aware of whats going on as well as more intelligent. |
|
|
|
The model seems to give out what you give in, sloppy card or first message leads to more of the same. The model is quite good at taking a human written card with stuff like conversational narration, and then continue that style. |
|
|
|
It was trained on 4xH100 NVL for 6 hours using Lora+. I still want to train it further because it seems like the more data we put in, the better the model gets at writing and roleplaying. |
|
|
|
Test and see I guess. |
|
|
|
Me and my teammate are sick rn xD and I am currently working with another teammate on some good stuff, we can finally break away from AI generated datasets, at least for the most part. Once it is done, the 8B, 12B and 70B will be used with that dataset to train with. I hope we succeed at this, it will make me so, so happy. |
|
|
|
We are also experimenting with RLHF, KTO and PPO mainly. |
|
|
|
When we do a proper release, it will have a lot of writeup. |
|
|
|
--- |
|
|
|
Datasets used: |
|
|
|
# Name, sample size, whether to force RP format, whether to apply len limit (for the first message, seq len limit is always applied), unkown_boolean, minimum message count, system message |
|
- Reddit WP |
|
["reddit_writing_prompts.jsonl", 0.4, True, True, False, 2, "Write a story based on prompt provided by user below. Mode: SFW"],<br> |
|
- Instruct |
|
["combined_25k_HOTFIX_declauded_englishonly_sysprompt_name_swap.jsonl", 0.1, False, True, False, 2, ""],<br> |
|
["slim-orca.json", 0.1, False, True, False, 2, ""], |
|
- Synth story |
|
["writing-struct-deslopped.json", 0.1, False, True, False, 2, ""],<br> |
|
Claude RP 0.8 |
|
|
|
Thank you Nopm, Gryphe (double thanks), and kalomaze, and any other people involved in making those datasets. r/DirtyWritingPrompts was dropped because it would induce undesirable features. No worries though, NSFW will be stronger than ever lmao. |
|
|
|
We used 10,000 rows, so take those ratios, normalise them so they add up to 1 and then that will be the division of the dataset. You can find all datasets by googling them, they are on huggingface, Claude RP is c2 logs but we filtered it ourselves. |
|
|
|
--- |
|
|
|
Axolotl Config: |
|
|
|
```yaml |
|
# Model |
|
base_model: meta-llama/Meta-Llama-3.1-70B-Instruct |
|
model_type: LlamaForCausalLM |
|
tokenizer_type: AutoTokenizer |
|
trust_remote_code: true |
|
|
|
# Output and HuggingFace |
|
output_dir: /workspace/data/train-results/trained_model |
|
hub_model_id: |
|
hf_use_auth_token: true |
|
hub_strategy: "all_checkpoints" |
|
|
|
# WandB |
|
wandb_project: huggingface |
|
wandb_entity: |
|
|
|
# Data |
|
chat_template: llama3 |
|
train_on_inputs: false |
|
group_by_length: true |
|
datasets: |
|
- path: |
|
type: sharegpt |
|
roles: |
|
input: |
|
- system |
|
- user |
|
output: |
|
- assistant |
|
## Evaluation |
|
val_set_size: 0.01 |
|
evals_per_epoch: 4 |
|
eval_table_size: |
|
eval_max_new_tokens: 128 |
|
|
|
# Technical aspects |
|
sequence_len: 8192 |
|
save_safetensors: true |
|
saves_per_epoch: 2 |
|
logging_steps: 1 |
|
special_tokens: |
|
pad_token: <|end_of_text|> |
|
|
|
# Quantization |
|
bf16: auto |
|
fp16: |
|
tf32: false |
|
## For LoRA |
|
load_in_8bit: false |
|
load_in_4bit: true |
|
|
|
# LoRA |
|
adapter: qlora # or qlora |
|
lora_model_dir: |
|
lora_r: 256 |
|
lora_alpha: 256 |
|
lora_dropout: 0.1 |
|
lora_target_linear: true |
|
lora_fan_in_fan_out: |
|
lora_target_modules: |
|
|
|
loraplus_lr_ratio: 8 |
|
loraplus_lr_embedding: |
|
|
|
# Training hyperparameters |
|
# max_steps: |
|
num_epochs: 1 # TODO Perhaps reduce this because LORA+ only needs 1 epoch. |
|
|
|
# Anti Overfit and Stability |
|
weight_decay: 0.01 |
|
max_grad_norm: 1.0 # Might increase this to 15 or something. |
|
|
|
## Learning Rate |
|
warmup_ratio: 0.05 |
|
learning_rate: 0.000008 |
|
lr_scheduler: cosine_with_min_lr |
|
lr_scheduler_kwargs: |
|
min_lr: 0.0000024 |
|
optimizer: paged_adamw_8bit # usually adamw_torch or paged_adamw_8bit |
|
|
|
## Batch Size |
|
gradient_accumulation_steps: 1 |
|
micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps |
|
eval_batch_size: 1 |
|
|
|
# Optimizations |
|
pad_to_sequence_len: true |
|
sample_packing: true |
|
eval_sample_packing: true |
|
flash_attention: true |
|
xformers_attention: |
|
gradient_checkpointing: "unsloth" |
|
gradient_checkpointing_kwargs: |
|
use_reentrant: true |
|
local_rank: |
|
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all |
|
# Misc |
|
early_stopping_patience: |
|
debug: |
|
``` |