Text Generation
Transformers
PyTorch
llama
text-generation-inference
Inference Endpoints

BOS is actually the EOS token by default

#30
by dblakely - opened

Hi, when I load the tokenizer for this model, it appears the BOS and EOS tokens are the same (both are set to the EOS token).

Example:

>>> from transformers import AutoTokenizer
>>> model_name = "WizardLM/WizardLM-13B-V1.2"
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> tokenizer.bos_token
'</s>'      # <-- this is the EOS token, not the BOS token
>>> tokenizer.bos_token_id
2
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
2

You can see this as well when you tokenize something:

>>> tokenizer.decode(tokenizer("This is an input").input_ids)
'</s> This is an input'

Looking at the special_tokens_map.json file, you can see:

{
  "bos_token": "</s>",
  "eos_token": "</s>",
  "pad_token": "<unk>",
  "unk_token": "</s>"
}

Is this the typo? The bos_token should be set to "<s>" instead should it not?

Sign up or log in to comment