tokenizer.model is missing
Hi, I tried to load the tokenizer of this model, but met the following error: TypeError: not a string.
I think it is because tokenizer.model is missing in this entry.
Could you please check and upload it? Thanks!
@YuxinXiao
Hi,
Could you share your transformers version? I will check tokenizer loading in your transformers version. FYI, our transformers version is '4.31.0'.
@YuxinXiao
Hello,
I tested with the code below and it was successful in loading the tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
"upstage/llama-65b-instruct",
force_download=True
)
I've confirmed that it works on both transformer==4.30.0
and transformer==4.30.1
.
Hi, I'm using transformer==4.31.0
.
When I run
name = 'upstage/llama-65b-instruct'
tokenizer = AutoTokenizer.from_pretrained(name, use_fast=False, force_download=True)
I get the following error
TypeError Traceback (most recent call last)
Cell In[5], line 2
1 name = 'upstage/llama-65b-instruct'
----> 2 tokenizer = AutoTokenizer.from_pretrained(name, use_fast=False, force_download=True)
3 # model = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True, low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map='auto')
File ~/miniconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:702, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
698 if tokenizer_class is None:
699 raise ValueError(
700 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
701 )
--> 702 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
704 # Otherwise we have to be creative.
705 # if model is an encoder decoder, the encoder tokenizer class is used by default
706 if isinstance(config, EncoderDecoderConfig):
File ~/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1841, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
1838 else:
1839 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1841 return cls._from_pretrained(
1842 resolved_vocab_files,
1843 pretrained_model_name_or_path,
1844 init_configuration,
1845 *init_inputs,
1846 use_auth_token=token,
1847 cache_dir=cache_dir,
1848 local_files_only=local_files_only,
1849 _commit_hash=commit_hash,
1850 _is_local=is_local,
1851 **kwargs,
1852 )
File ~/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2004, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
2002 # Instantiate tokenizer.
2003 try:
-> 2004 tokenizer = cls(*init_inputs, **init_kwargs)
2005 except OSError:
2006 raise OSError(
2007 "Unable to load vocabulary from file. "
2008 "Please check that the provided vocabulary is accessible and not corrupted."
2009 )
File ~/miniconda3/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py:144, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, legacy, **kwargs)
142 self.add_eos_token = add_eos_token
143 self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 144 self.sp_model.Load(vocab_file)
File ~/miniconda3/lib/python3.10/site-packages/sentencepiece/__init__.py:905, in SentencePieceProcessor.Load(self, model_file, model_proto)
903 if model_proto:
904 return self.LoadFromSerializedProto(model_proto)
--> 905 return self.LoadFromFile(model_file)
File ~/miniconda3/lib/python3.10/site-packages/sentencepiece/__init__.py:310, in SentencePieceProcessor.LoadFromFile(self, arg)
309 def LoadFromFile(self, arg):
--> 310 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
In fact, you can find tokenizer.model
in the "files and versions" folder of upstage/llama-30b-instruct
, but you can't see it here.
So I think the error is due to the missing tokenizer.model
.
@YuxinXiao Thanks a lot.
For use_fast=False
, it seems to require the tokenizer.model and we uploaded it.
We checked that it loaded without any problem.
Could you give it one more try?
thanks for uploading it! it works fine now.