Feedback/General Discussion
Hello, shortly I'll be giving my experience from minimal testing when using the 6bpw of this model, and well.. it's not particularly great. Perhaps I'm missing something whether it's the formatting or sampling parameters.
Firstly, here's my setup. I use latest Sillytavern staging on my android, used alpaca roleplay formatting with a peculiar character card system. I put example dialogue on desc:
and use author note at depth 1(To retain information consistency) with info in second person(So the model can receive info directly without bias):
I also use oobabooga as backend which is run using Free Colab(15gb vram) and run it with the following flags:
--n-gpu-layers 128 --loader exllamav2_hf --cache_4bit --max_seq_len 16384 --compress_pos_emb 4 --no_inject_fused_attention --no_use_cuda_fp16 --disable_exllama
For my actual experience, well.. It's not great I think I might be missing something but besides that, model's very confused, responses are wack in general.
Responses(Min_P: 0.08-0.1):
Overall, pretty sure other 10.7b's would have done better. The model's handling of author note is off at depth 1, and gens are pretty gibberish when I adjusted min_p or smoothing to higher values.
Try resetting your SillyTavern preset, i remember one time i accidentally moved something for Erosumika and all answers were dumb af until i restored default settings and saw the light agin.
Oh, I see the problem now. It's probably because of my character card system. It wasn't a problem when I used 13-20b's with my custom character info system, will test tom when I have the time. Also prob the sampler priority.
Okay, so after some fiddling around it turns out that the problem was this exact flag: ' --compress_pos_emb'. I was able to get decent responses after removing the flag, and using the same setup as above. After some minimal testing here's what I found:
Responses end early: There are times where responses cap at 300 tokens, but other times, the responses end early.
Responses act a bit weird: Probably the sampling(Which is just Min_P: 0.045 and Smoothing: 0.08) but in some responses, weird interjections happen like when character expresses her frustration then out of the blue a servant comes out and says 'but perhaps there is no shame, perhaps we are just children' it doesn't make sense context wise.). Also the fact that model sometimes spews out header text like: INPUT, OUTPUT, NEW RULE, probably the advanced formatting on that.
Traces of word repetition: Again probably the sampling, I don't particularly use repetition penalty, and there are parts of a response like 'while maintaining a beguiled expression, has a disdainful expression'.
Doesn't capture example response nuances: When testing with the card 'Renpet' I noticed the model does not pickup the nuances of how she talks like 'shishishi' in any response at all.
Confused with complicated Author Note: While model does not capture speech nuances, it also is confused with Author note info, for instance I tested the 'Victoria' character with it with its author note info having ali chat info, yet model confuses her to be an anthromorphic character with paws despite otherwise.
Heavily focuses from user's point of view though does not talk as user: Perhaps it's the desc example dialogue containing user. besides that, model does focus on user's view and acts as them a lot.
Model does not follow formatting somehow?: First time I've seen it, despite checking include names, name of char is not included.
Does have instruct following capability: In Victoria's card, there's rules in regards to inner thoughts and the model follows that, so there's that.
Does capture personallity: Personality in characters is decent.
Interesting Quirks: In some sentences in a response I noticed alliteration, which is interesting but strange like 'but as they bubbled and blundered, they only served to stutter and stammer, yeah that looks like word repetition but with extra steps.
Long Context Works: It's a given.
Responses(Sampling: Min_P: 0.045 Smoothing: 0.08):
(Short Responses):
(Some Strange responses, some with alliteration):
(Acting as user)
(Responses with weird output headers)
Overall, model has some good potential, needs some ironing out in regards to how model applies and looks at author note, generating responses and a bit improved wording.
Shouldn't short answers be also great ones? I thought it is the main problem when LLM can't stop and writes a lot of actions so you can't even react and need to cut off it's response.
INPUT, OUTPUT, NEW RULE is something i never saw while using this model, try playing with parameters. I already sent you my SillyTavern preset(default one), here is my Ooba preset
If toying with parameters changes nothing for you then try to add them in "Banned Tokens" section
I forgot to clarify about that. What I meant by short responses is that the model likes to end responses early despite 'eos' token being banned. It's more of a 'model does not follow your set preferences kind of thing, say if I set response tokens to '500', I expect responses to at least be long constantly, instead the model output only less than '200' tokens in 3/5 gens then that would sound like a problem. Also when using the 'Victoria' card the model would output very short responses at times and in some responses it would focus the majority of the tokens at the 'inner thoughts' part. Usually in 13b's or above I would have to actually have to press continue for the 'Inner thoughts' part to appear, there the responses at least have the main focus be on the body of the text and not the 'inner thoughts' part. Now on the main problem you mentioned, I don't particularly mind about that as in my experience using 13b and above and in some 10.7b's the models would accentuate few actions then extend them in a way that meets target length response tokens. When using the model, one of the main problems is that the model really likes to make assumptions on user's part without waiting for user to react. Later when I have time, I'll try testing the model a bit with your preset but, I'm kind of not a fan of it. I might be wrong as I have limited knowledge on sampling so correct me if I am. Your preset has: Top_P: 0.9, Top_K: 20, Temp: 0.7 and Rep Pen: 1.15. With top k, the number of tokens the llm model would choose from would be 20, with top_p, the amount would lessen to probably become less than 12 or even lesser , and with temp, the tokens would be more deterministic. Add that with repitition penalty and the preset makes any model be ultra deterministic with crippled creativity. For a use case of a model that's for answering questions objectively and for general purposes, that's fine but for roleplay, It would not be viable for a unique rp experience.
So I tested out the preset you gave me:
So there's still short responses constantly despite banning eos token and one response leaked advanced formatting when pressing continue:
And when I do get long responses, the model pretty much is confused by Example messages in desc:
Overall, while some issues were resolved, model still gets confused on certain things like being in character and reading example messages.
Thanks for your feedback, im sure eos token isn't model's fault but i can improve model's roleplay capabilities by slightly fine-tuning it with small long-turn RP dataset, i think any merge quality would improve if finetune it as a solid model.
After all, i planned to release only 2x10.7B, but this model by itself was pretty fun to toy with, so i decided to upload it.
No problem, just saw the 2x10.7b model then saw this one and decided to test this one first. Turns out you were right about the 'eos' part, though I don't like the model's tendency for short responses.