Update max_position_embeddings to 16384

I'm running this model on a vLLM server and I am receiving the following error: `<Server response: {'code': None, 'message': "This model's maximum context length is 4096 tokens. However, you requested 4348 tokens (4332 in the messages, 16 in the completion). Please reduce the length of the messages or completion.", 'object': 'error', 'param': None, 'type': 'invalid_request_error'}>`

Since this is the 16k version of Vicuna-13b-v1.5, the maximum context length should be 16384. The max_sequence_length in config.json is the only place left where there is any mention of 4096. I'm making the assumption that the parameter is what is causing the server error.

Files changed (1) hide show

config.json +1 -1

config.json CHANGED Viewed

@@ -9,7 +9,7 @@
   "hidden_size": 5120,
   "initializer_range": 0.02,
   "intermediate_size": 13824,
-  "max_position_embeddings": 4096,
   "max_sequence_length": 16384,
   "model_type": "llama",
   "num_attention_heads": 40,

   "hidden_size": 5120,
   "initializer_range": 0.02,
   "intermediate_size": 13824,
+  "max_position_embeddings": 16384,
   "max_sequence_length": 16384,
   "model_type": "llama",
   "num_attention_heads": 40,