the model always responses with same answer with different image
the response doesn't seem to accurately reflect the provided image. The output suggests a different image of a person in front of a hotel, which indicates a possible limitation in the model's ability to interpret the specific image URL provided.
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="cjpais/llava-1.6-mistral-7b-gguf",
filename="llava-v1.6-mistral-7b.Q3_K.gguf",
)
response=llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cloudinary-marketing-res.cloudinary.com/images/w_1000,c_scale/v1679921049/Image_URL_header/Image_URL_header-png?_i=AA"
# "url":r"C:\Users\we\Pictures\dog.PNG"
}
}
]
}
]
)
print(response)
{'id': 'chatcmpl-65577668-465e-429d-a2f8-6468a2827afe', 'object': 'chat.completion', 'created': 1735078321, 'model': 'C:\Users\we\.cache\huggingface\hub\models--cjpais--llava-1.6-mistral-7b-gguf\snapshots\6019df415777605a8364e2668aa08b7e354bf0ba\.\llava-v1.6-mistral-7b.Q3_K.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' The image shows a person wearing a white shirt and black pants, standing in front of a building with a sign that reads "HOTEL". The person appears to be looking at the camera, and the overall setting suggests that this might be a hotel entrance or lobby. The lighting in the image is bright, and the person is the main focus of the photograph. '}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 74, 'total_tokens': 80}}
it only supports base64 encoded image iirc, its been a long time since I worked with this particular model