BOS is actually the EOS token by default
#30
by
dblakely
- opened
Hi, when I load the tokenizer for this model, it appears the BOS and EOS tokens are the same (both are set to the EOS token).
Example:
>>> from transformers import AutoTokenizer
>>> model_name = "WizardLM/WizardLM-13B-V1.2"
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> tokenizer.bos_token
'</s>' # <-- this is the EOS token, not the BOS token
>>> tokenizer.bos_token_id
2
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
2
You can see this as well when you tokenize something:
>>> tokenizer.decode(tokenizer("This is an input").input_ids)
'</s> This is an input'
Looking at the special_tokens_map.json
file, you can see:
{
"bos_token": "</s>",
"eos_token": "</s>",
"pad_token": "<unk>",
"unk_token": "</s>"
}
Is this the typo? The bos_token
should be set to "<s>"
instead should it not?