`max_position_embeddings=32768` with "attention span of 131K tokens"
#57
by
Nadav-Timor
- opened
Hi,
Can you please clarify how you use the max_position_embeddings
hyperparameter? The config.json
file specifies max_position_embeddings=32768
while the paper claims an attention span of 131K tokens (see Section 2 on "Architectural details" β "Sliding Window Attention").
Thanks!
See this GitHub Issue by @ParadoxZW from a few days ago