Correct maximum positional embeddings
The model appears to have been trained with context window = 512, not 2048 as claimed here. This can be seen by looking at the average loss by sequence position on the GPT4 tiny stories dataset (packed into inputs of length 2048):
It would be great to get this changed (for all tinystories models), as the current config is misleading.
You are quite correct, not sure what is up with the huggingface models as:
From the paper: OurmodelsareavailableonHuggingfacenamedTinyStories-1M/3M/9M/28M/33M/1Layer/2LayerandTinyStories-Instruct-β.We use GPT-Neoarchitecturewithwindowsize256andcontext length512 .WeuseGPT-Neotokenizerbut only keep the top 10K most common tokens.
You're right. Our paper does indicate that we use 512 seq len in training, but the model's config should be updated...
@roneneldan Do you plan to merge this? I was also planning to contribute a version with the 10K vocab, would you consider merging that too or do you prefer the current format?