raise error when `use_cache = True`
transformers version: 4.33.2
AutoModelForCausalLM.from_pretrained("microsoft/phi-1", trust_remote_code=True, torch_dtype="auto", use_cache=True)
raise the following error:
File /usr/local/lib/python3.9/dist-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
556 else:
557 cls.register(config.__class__, model_class, exist_ok=True)
--> 558 return model_class.from_pretrained(
559 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
560 )
561 elif type(config) in cls._model_mapping.keys():
562 model_class = _get_model_class(config, cls._model_mapping)
File /usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:2966, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
2963 init_contexts.append(init_empty_weights())
2965 with ContextManagers(init_contexts):
-> 2966 model = cls(config, *model_args, **model_kwargs)
2968 # Check first if we are `from_pt`
2969 if use_keep_in_fp32_modules:
TypeError: __init__() got an unexpected keyword argument 'use_cache'
Hey @wjfwzzc , thanks for your issue!
It seems there is an issue with the propagation of unused kwargs when using remote code, cc @ArthurZ .
To do what you're trying to do, you could define a GenerationConfig
locally with use_cache
set to True
:
from transformers import GenerationConfig
generation_config = GenerationConfig(use_cache=True)
You can then pass this to the generate
method:
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
>>> inputs = tokenizer('''```python
... def print_prime(n):
... """
... Print all primes between 1 and n
... """''', return_tensors="pt", return_attention_mask=False)
>>> model.generate(**inputs, max_length=200, generation_config=generation_config)
Please let me know if that works for you!
Hi
@lysandre
, thanks for your help and it works for me!
Nevertheless I'm still confused about the attention_mask
. It seems that return_attention_mask=True
will raise
ValueError: The following `model_kwargs` are not used by the model: ['attention_mask'] (note: typos in the generate arguments will also show up in this list)
But how to do batch inferencing with padding without attention mask?
Hey
@wjfwzzc
, Phi is being contributed to transformers
in this PR: https://github.com/huggingface/transformers/pull/26170
This should enable leveraging the attention mask to perform batch inference.