Florence-2 PixelProse LoRA Adapter

This repository contains a LoRA adapter trained on the tomg-group-umd/pixelprose dataset for the Florence-2-base-FT model. It's designed to enhance the model's captioning capabilities, providing more detailed and descriptive image captions.

Usage

To use this LoRA adapter, you'll need to load it along with the Florence-2-base model using the PEFT library. Here's an example of how to use it:

from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import requests

def caption(image):
    base_model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
    processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
    prompt = "<MORE_DETAILED_CAPTION>"
    adapter_name = "NikshepShetty/Florence-2-pixelprose"
    model = PeftModel.from_pretrained(base_model, adapter_name, trust_remote_code=True)
    inputs = processor(text=prompt, images=image, return_tensors="pt")

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    parsed_answer = processor.post_process_generation(generated_text, task="<MORE_DETAILED_CAPTION>", image_size=(image.width, image.height))

    print(parsed_answer)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
caption(image)

This code demonstrates how to:

  1. Load the base Florence-2 model
  2. Load the LoRA adapter
  3. Process an image and generate a detailed caption

Note: Make sure you have the required libraries installed: transformers, peft, einops, flash_attn, timm, Pillow, and requests.

Evaluation results

Our LoRA adapter shows improvements over the base Florence-2 model across all metrics for MORE_DETAILED_CAPTION tag for 1000 images on the foundation-multimodal-models/DetailCaps-4870 dataset:

Metric Base Model Adapted Model Improvement
CAPTURE 0.546 0.555 +1.6%
METEOR 0.213 0.250 +17.4%
BLEU 0.110 0.155 +40.9%
CIDEr 0.031 0.039 +25.8%
ROUGE-L 0.275 0.298 +8.4%

These results demonstrate that our LoRA adapter enhances the image captioning capabilities of the Florence-2 base model, particularly in generating more detailed and accurate captions.

Downloads last month
1,343
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train NikshepShetty/Florence-2-pixelprose

Evaluation results

  • meteor on foundation-multimodal-models/DetailCaps-4870
    self-reported
    0.250
  • bleu on foundation-multimodal-models/DetailCaps-4870
    self-reported
    0.155
  • cider on foundation-multimodal-models/DetailCaps-4870
    self-reported
    0.039
  • capture on foundation-multimodal-models/DetailCaps-4870
    self-reported
    0.555
  • rouge-l on foundation-multimodal-models/DetailCaps-4870
    self-reported
    0.298