brucethemoose commited on
Commit
6873837
1 Parent(s): add2ac4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md CHANGED
@@ -51,6 +51,68 @@ I am a huge fan of Kalomaze's quadratic sampling (shown as "smoothing factor" wh
51
 
52
  Otherwise, I recommend a lower temperature with 0.1 or higher MinP, a little repetition penalty, and mirostat with a low tau, and no other samplers. See the explanation here: https://github.com/ggerganov/llama.cpp/pull/3841
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  24GB GPUs can efficiently run Yi-34B-200K models at **40K-90K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/). Empty 16GB GPUs can still run the high context with aggressive quantization.
55
 
56
  To load/train this in full-context backends like transformers, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM! I do not recommend running high context without context-efficient backends that support flash attention + 8 bit kv cache, like exllamav2, litellm, vllm or unsloth.
 
51
 
52
  Otherwise, I recommend a lower temperature with 0.1 or higher MinP, a little repetition penalty, and mirostat with a low tau, and no other samplers. See the explanation here: https://github.com/ggerganov/llama.cpp/pull/3841
53
 
54
+ @MarinaraSpaghetti has extensively tested the model and recommended the following settings, and they seem to work quite well:
55
+
56
+ ```
57
+ {
58
+ "temp": 1,
59
+ "temperature_last": true,
60
+ "top_p": 1,
61
+ "top_k": 0,
62
+ "top_a": 0,
63
+ "tfs": 1,
64
+ "epsilon_cutoff": 0,
65
+ "eta_cutoff": 0,
66
+ "typical_p": 0.9,
67
+ "min_p": 0,
68
+ "rep_pen": 1.1,
69
+ "rep_pen_range": 19456,
70
+ "no_repeat_ngram_size": 0,
71
+ "penalty_alpha": 0,
72
+ "num_beams": 1,
73
+ "length_penalty": 0,
74
+ "min_length": 0,
75
+ "encoder_rep_pen": 1,
76
+ "freq_pen": 0,
77
+ "presence_pen": 0,
78
+ "do_sample": true,
79
+ "early_stopping": false,
80
+ "dynatemp": false,
81
+ "min_temp": 1,
82
+ "max_temp": 2,
83
+ "dynatemp_exponent": 1,
84
+ "smoothing_factor": 0.33,
85
+ "add_bos_token": false,
86
+ "truncation_length": 2048,
87
+ "ban_eos_token": false,
88
+ "skip_special_tokens": true,
89
+ "streaming": true,
90
+ "mirostat_mode": 0,
91
+ "mirostat_tau": 5,
92
+ "mirostat_eta": 0.1,
93
+ "guidance_scale": 1,
94
+ "negative_prompt": "",
95
+ "grammar_string": "",
96
+ "banned_tokens": "",
97
+ "ignore_eos_token_aphrodite": false,
98
+ "spaces_between_special_tokens_aphrodite": true,
99
+ "sampler_order": [
100
+ 6,
101
+ 0,
102
+ 1,
103
+ 3,
104
+ 4,
105
+ 2,
106
+ 5
107
+ ],
108
+ "logit_bias": [],
109
+ "n": 1,
110
+ "rep_pen_size": 0,
111
+ "genamt": 400,
112
+ "max_length": 38912
113
+ }
114
+ ```
115
+
116
  24GB GPUs can efficiently run Yi-34B-200K models at **40K-90K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/). Empty 16GB GPUs can still run the high context with aggressive quantization.
117
 
118
  To load/train this in full-context backends like transformers, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM! I do not recommend running high context without context-efficient backends that support flash attention + 8 bit kv cache, like exllamav2, litellm, vllm or unsloth.