Safetensors
qwen2_vl
vidore
reranker
Edit model card

MonoQwen2-VL-v0.1

Model Overview

The MonoQwen2-VL-v0.1 is a multimodal reranker finetuned with LoRA from Qwen2-VL-2B, optimized for asserting pointwise image-query relevance using the MonoT5 objective. That is, given a couple of image and query fed into the prompt of the VLM, the model is tasked to generate "True" if the image is relevant to the query and "False" otherwise. During inference, a relevancy score can then be obtained by comparing the logits of the two tokens and this score can effectively be used to rerank the candidates generated by a first-stage retriever (such as DSE or ColPali) or filter them using a threshold.

The ColPali train set was used to train this model with negatives mined using DSE.

How to Use the Model

Below is a quick example to rerank a single image against a user query using this model:

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# Load processor and model
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# Define query and load image
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# Construct the prompt and prepare input
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# Apply chat template and tokenize
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# Run inference to obtain logits
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# Convert tokens and calculate relevance score
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# Extract and display probabilities
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

This example demonstrates how to use the model to assess the relevance of an image with respect to a query. It outputs the probability that the image is relevant ("True") or not relevant ("False").

Note: this example requires peft to be installed in your environment (pip install peft). If you don't want to use peft, you can use model.load_adapter on the original Qwen2-VL-2B model.

Performance Metrics

The model has been evaluated on ViDoRe Benchmark, by retrieving 10 elements with MrLight_dse-qwen2-2b-mrl-v1 and reranking them. The table below summarizes its ndcg@5 scores:

Dataset MrLight_dse-qwen2-2b-mrl-v1 MonoQwen2-VL-v0.1 reranking
vidore/arxivqa_test_subsampled 85.6 89.0
vidore/docvqa_test_subsampled 57.1 59.7
vidore/infovqa_test_subsampled 88.1 93.2
vidore/tabfquad_test_subsampled 93.1 96.0
vidore/shiftproject_test 82.0 93.0
vidore/syntheticDocQA_artificial_intelligence_test 97.5 100.0
vidore/syntheticDocQA_energy_test 92.9 97.7
vidore/syntheticDocQA_government_reports_test 96.0 98.0
vidore/syntheticDocQA_healthcare_industry_test 96.4 99.3
vidore/tatdqa_test 69.4 79.0
Mean 85.8 90.5

License

This LoRA model is licensed under the Apache 2.0 license.

Downloads last month
759
Inference API
Unable to determine this model's library. Check the docs .

Model tree for lightonai/MonoQwen2-VL-v0.1

Finetuned
(18)
this model

Dataset used to train lightonai/MonoQwen2-VL-v0.1