README.md · dblakely/WizardLM-13B-V1.2-fixed-tokenizer at main

metadata

license: llama2

This is a slightly modified versions of the original WizardLM/WizardLM-13B-V1.2 checkpoint that fixes a few bugs:

In the original checkpoint, the BOS token is set to the EOS token (</s>, token ID 2). In this version, the BOS is reverted to <s> (token ID 1).
The original has a mismatch between the size of the tokenizer vocab and the model embedding vocab. This is because the tokenizer includes an extra token for the added [PAD] token, making the vocab 32,001 tokens. This discrepancy can cause index errors. This version simply removes the added [PAD] in favor of using the <unk> (token ID 0) for padding. So the tokenizer's vocab is reverted back to a size of 32,000 to match the model's vocab size.

For all other information about this model, refer to the original WizardLM/WizardLM-13B-V1.2 checkpoint.