Difference between this and 8k version?
Seems like training data and params are the same, other than the different in config what is different with this model?
This model is trained on a scaling factor of 0.125 using the same technique that was used on the 8K model. This model should have max sequence length of 16384
Ah, I wonder how the training on the 0.125 scaling factor affects performance at lower context lengths, ie. I wonder how the 8k model and 16k model perform at lets say a 4k context length and 1:2 ratio
A similar perplexity text should determine what performance difference exists, since the only difference between this and the 8K version is the scaling factor and max length, so it should be easy to compare the two. I will also try training one with just scaling of 0.5 (4096 max length)
sounds good, ill merge the adapter and see if I can get some numbers for you, im curious to see how it changes
I have some results for you:
16K merged into model has ppl of 7.3050, 8K merged into a model has ppl of 7.5387 (using 2k context)
16K is at 7.7976 and the 8K is at 7.7789 - at 8k context, scaling at 4
16K is at 9.4433 and the 8K is at 11.3963 - at 16K context and 8 scaling
very interesting that the 16K model seems to be somewhat "better" at 2K context, and also proof that the 16K training works with the lower ppl at higher scaling
This sounds silly, but can we train a 32K/64K model? I wonder if this will trend will continue for some reason -- we do need to test recall at larger context lengths too but with this pattern at 64K SuperHOT running at 8K context will probably be better than a 8K SuperHOT
I have some results for you:
16K merged into model has ppl of 7.3050, 8K merged into a model has ppl of 7.5387 (using 2k context)
16K is at 7.7976 and the 8K is at 7.7789 - at 8k context, scaling at 4
16K is at 9.4433 and the 8K is at 11.3963 - at 16K context and 8 scalingvery interesting that the 16K model seems to be somewhat "better" at 2K context, and also proof that the 16K training works with the lower ppl at higher scaling
this difference might just be because 16K was trained with lora rank 4 and 8K was trained with lora rank 2
@flashvenom I would expect the ppl to be lower at 16K for the one trained with 16K since it learns the proper dilated frequency. Still surprising it has lower ppl on the short range as well
can we train a 32K/64K model?
This model was a test just to see if my idea was correct that even with 4K data, it should work even if you go to 16K. It seems to work, so I would encourage others to try going even higher to find the limit now that you don't need 16K data to train to 16K context. I want to also investigate some other architectural changes we could make.
training 32K/64K models might be worthwile for experiment sake, but not a lot of people have enough vram to do that.
this is a great breakthrough but we may want to look at new ways to increase context without increasing vram requirements as much, still, that's amazing we could increase context that much.
@alkeryn This method is solely an augmentation to positional encodings. It only allows the context of the pre-trained model to be increased without using much training data or compute. The issue of quadratic attention is orthogonal to this method (e.g. fast attention method will likely not have anything to do with position encoding) Besides, mechanism such as xformers and flash attention also exist, not to mention the recent vLLM, which can all work alongside since the issue of attention and KV cache are solely outside the domain of position information.
Although I am currently working on a method to alleviate the memory issue.
EDIT: To clarify further, extending context here is tackled as a position encoding problem. Using that context is a separate issue.