Add support for AutoModelForCausalLM or LlavaForConditionalGeneration loading
Could you add support for this?
Right now the config.json file in these repos is just the Llama 3 config.
When trying to load with LlavaForConditionalGeneraton.from_pretrained(), I get:
Some weights of LlavaForConditionalGeneration were not initialized from the model checkpoint at xtuner/llava-llama-3-8b-v1_1 and are newly initialized: ['model.language_model.lm_head.weight', 'model.language_model.model.embed_tokens.weight', 'model.language_model.model.layers.0.input_layernor
When trying to run evaluation, even just running the processor with:
inputs = processor(prompt, raw_image, return_tensors='pt').to(DEVICE, DTYPE)
I get:
ValueError Traceback (most recent call last)
Cell In[64], line 15
13 raw_image = sample['image']
---> 15 inputs = processor(prompt, raw_image, return_tensors='pt').to(DEVICE, DTYPE)
16 output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
17 print('model: ', processor.decode(output[0][2:], skip_special_tokens=True))
File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2858, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2856 if not self._in_target_context_manager:
2857 self._switch_to_input_mode()
-> 2858 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
2859 if text_target is not None:
2860 self._switch_to_target_mode()
File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2922, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2916 raise ValueError(
2917 "text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) "
2918 "or `List[List[str]]` (batch of pretokenized examples)."
2919 )
2921 if text_pair is not None and not _is_valid_text_input(text_pair):
-> 2922 raise ValueError(
2923 "text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) "
2924 "or `List[List[str]]` (batch of pretokenized examples)."
2925 )
2927 if is_split_into_words:
2928 is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
@RonanMcGovern Thank you very much for your feedback. We will strive to provide support as soon as possible.
Hi,
Can you please help me out how to load and do prediction llava llama3 vision model
We have released weights similar to LLaVA v1.5/v1.6 architecture here. You can try this model with your workflow!
ok, great, I'll try loading that with a LLaVA flow.
@darkshadow
@RonanMcGovern
LlavaForConditionalGeneration model is here!
https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers
Ok, that's great, thanks!