Is it possible to only input text in LLaVa model?

#38
by Tizzzzy - opened

Hi,
Currently I can successful do image question answering with LLaVa model with the following code:

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-1.5-7b-hf", device_map="auto")

def llava_describe(image):
    question = "<image> Describe this image as detail as possible."
    inputs = processor(images=image, text=question, return_tensors="pt").to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=200)
    answer = processor.decode(generated_ids[0][2:], skip_special_tokens=True)

I also want to only input text in the model. However, my code doesn't work:

def llava_describe(image):
    question = "..."
    inputs = processor(images=None, text=question, return_tensors="pt").to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=200)
    answer = processor.decode(generated_ids[0][2:], skip_special_tokens=True)

I am keep getting this error:

Traceback (most recent call last):
  File "/workspace/llava/model.py", line 138, in <module>
    generated_text = llava_describe(image)
  File "/workspace/llava/model.py", line 48, in llava_describe
    generated_ids = model.generate(**inputs, max_new_tokens=200)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 3206, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 487, in forward
    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 303, in _merge_input_ids_with_image_features
    num_images, num_image_patches, embed_dim = image_features.shape
AttributeError: 'NoneType' object has no attribute 'shape'

Note this task is important for me, and I really want LLaVa to support text only also.
Thank you for your help!

Llava Hugging Face org

Hey @Tizzzzy !

Currently Llava models will not support text-only input. I have been changing lot of stuff lately with llava models and will bring back the text-only inference soon. It got removed accidentally but it shouldn't have been

Hello!I am facing the same problem. Did you find a way to solve it?

Currently, I modified the modeling_llava.py in line 487 and successfully managed to only input text in LLaVa model

            # prefill stage vs decoding stage (legacy behavior copied)
            if input_ids.shape[1] != 1:
                if image_features is not None: ##add this
                    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                        image_features, inputs_embeds, input_ids, attention_mask, labels
                    )
Llava Hugging Face org

@ZoeyYao27 will be resolved in the next v4.47 release, and yes if you can change the source code and install from source then the way to go is to add an extra indentation so that images are merged with input ids, only when pixel values are not None. The way you did is also good workaround until we make the release

Sign up or log in to comment