Is it possible to only input text in LLaVa model?
Hi,
Currently I can successful do image question answering with LLaVa model with the following code:
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-1.5-7b-hf", device_map="auto")
def llava_describe(image):
question = "<image> Describe this image as detail as possible."
inputs = processor(images=image, text=question, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=200)
answer = processor.decode(generated_ids[0][2:], skip_special_tokens=True)
I also want to only input text in the model. However, my code doesn't work:
def llava_describe(image):
question = "..."
inputs = processor(images=None, text=question, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=200)
answer = processor.decode(generated_ids[0][2:], skip_special_tokens=True)
I am keep getting this error:
Traceback (most recent call last):
File "/workspace/llava/model.py", line 138, in <module>
generated_text = llava_describe(image)
File "/workspace/llava/model.py", line 48, in llava_describe
generated_ids = model.generate(**inputs, max_new_tokens=200)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 487, in forward
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 303, in _merge_input_ids_with_image_features
num_images, num_image_patches, embed_dim = image_features.shape
AttributeError: 'NoneType' object has no attribute 'shape'
Note this task is important for me, and I really want LLaVa to support text only also.
Thank you for your help!
Hey @Tizzzzy !
Currently Llava models will not support text-only input. I have been changing lot of stuff lately with llava models and will bring back the text-only inference soon. It got removed accidentally but it shouldn't have been
Hello!I am facing the same problem. Did you find a way to solve it?
Currently, I modified the modeling_llava.py in line 487 and successfully managed to only input text in LLaVa model
# prefill stage vs decoding stage (legacy behavior copied)
if input_ids.shape[1] != 1:
if image_features is not None: ##add this
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
image_features, inputs_embeds, input_ids, attention_mask, labels
)
@ZoeyYao27 will be resolved in the next v4.47 release, and yes if you can change the source code and install from source then the way to go is to add an extra indentation so that images are merged with input ids, only when pixel values are not None. The way you did is also good workaround until we make the release