raise error when `use_cache = True`

#23

by wjfwzzc - opened Sep 19, 2023

Sep 19, 2023

transformers version: 4.33.2

AutoModelForCausalLM.from_pretrained("microsoft/phi-1", trust_remote_code=True, torch_dtype="auto", use_cache=True)

raise the following error:

File /usr/local/lib/python3.9/dist-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    556     else:
    557         cls.register(config.__class__, model_class, exist_ok=True)
--> 558     return model_class.from_pretrained(
    559         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    560     )
    561 elif type(config) in cls._model_mapping.keys():
    562     model_class = _get_model_class(config, cls._model_mapping)

File /usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:2966, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   2963     init_contexts.append(init_empty_weights())
   2965 with ContextManagers(init_contexts):
-> 2966     model = cls(config, *model_args, **model_kwargs)
   2968 # Check first if we are `from_pt`
   2969 if use_keep_in_fp32_modules:

TypeError: __init__() got an unexpected keyword argument 'use_cache'

clem

Sep 22, 2023

maybe cc @lysandre

lysandre

Microsoft org Sep 22, 2023

Hey @wjfwzzc , thanks for your issue!

It seems there is an issue with the propagation of unused kwargs when using remote code, cc @ArthurZ .

To do what you're trying to do, you could define a GenerationConfig locally with use_cache set to True:

from transformers import GenerationConfig

generation_config = GenerationConfig(use_cache=True)

You can then pass this to the generate method:

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
>>> inputs = tokenizer('''```python
... def print_prime(n):
...     """
...     Print all primes between 1 and n
...     """''', return_tensors="pt", return_attention_mask=False)


>>> model.generate(**inputs, max_length=200, generation_config=generation_config)

Please let me know if that works for you!

wjfwzzc

Sep 23, 2023

Hi @lysandre , thanks for your help and it works for me!
Nevertheless I'm still confused about the attention_mask. It seems that return_attention_mask=True will raise

ValueError: The following `model_kwargs` are not used by the model: ['attention_mask'] (note: typos in the generate arguments will also show up in this list)

But how to do batch inferencing with padding without attention mask?

lysandre

Microsoft org Sep 26, 2023

Hey @wjfwzzc , Phi is being contributed to transformers in this PR: https://github.com/huggingface/transformers/pull/26170

This should enable leveraging the attention mask to perform batch inference.

gugarosa

Microsoft org Sep 26, 2023

Hello @wjfwzzc !

I just added support for attention_mask in the forward pass, so you should be able to perform batched inference. Meanwhile, this will be a proxy till Phi gets contributed to transformers (which I hugely appreciate that!).

gugarosa changed discussion status to closed Oct 3, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment