Edit model card

BLIP Image Captioning

Model Description

BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images.

Model Details

  • Model Architecture: BLIP (Bootstrapping Language-Image Pre-training)
  • Base Model: Salesforce/blip-image-captioning-base
  • Fine-tuning Dataset: mouwiya/image-in-words400
  • Number of Parameters: 109 million

Training Data

The model was fine-tuned on a shuffled and subsetted version of the "image-in-words400" dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development.

Training Procedure

  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Metric: BLEU Score

Usage

To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
from io import BytesIO

# Load the processor and model
model_name = "Mouwiya/BLIP_image_captioning"
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

# Example usage
image_url = "URL_OF_THE_IMAGE"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")

inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)

Evaluation

The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows:

  • Average BLEU Score: 0.35 This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams.

Limitations

  • Dataset Size: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities.
  • Domain-Specific: This model was trained on a specific dataset and may not perform as well on images from different domains.

Contact

Mouwiya S. A. Al-Qaisieh mo3awiya@gmail.com

Downloads last month
83
Safetensors
Model size
247M params
Tensor type
F32
·

Dataset used to train Mouwiya/BLIP_image_captioning