Blip2-Typhoon1.5-COCO
Model Description
Blip2-Typhoon1.5-COCO is a powerful image captioning model designed to generate descriptive captions for images. This model leverages the strengths of both the BLIP2 and Typhoon architectures to provide high-quality, contextually accurate descriptions. The base models used are:
- Encoder: Salesforce/blip2-opt-2.7b-coco
- Decoder: scb10x/llama-3-typhoon-v1.5-8b
The BLIP2 encoder extracts visual features from images, while the Typhoon decoder generates natural language descriptions based on these features.
Training Data
This model was trained on the COCO 2017 dataset, a widely-used benchmark dataset for image captioning tasks. The dataset includes a diverse set of images along with multiple human-generated captions for each image, enabling the model to learn rich and varied descriptive capabilities.
Training Details
- Datasets: COCO 2017
- Encoder: Salesforce/blip2-opt-2.7b-coco
- Decoder: scb10x/llama-3-typhoon-v1.5-8b
- Training Framework: Hugging Face Transformers
- Hardware: High-performance GPUs for efficient training
Usage
The Blip2-Typhoon1.5-COCO model can be used to generate captions for a wide variety of images. Here's how to use the model:
from PIL import Image
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# Load the processor and the model
processor = Blip2Processor.from_pretrained("MagiBoss/Blip2-Typhoon1.5-COCO")
model = Blip2ForConditionalGeneration.from_pretrained("MagiBoss/Blip2-Typhoon1.5-COCO", torch_dtype=torch.bfloat16)
# Prepare an image
image = Image.open("Your image...").convert("RGB")
# Generate a caption
inputs = processor(images=image, return_tensors="pt", padding=True).to(device, torch.bfloat16)
outputs = model.generate(**inputs, max_length=55, pad_token_id=processor.tokenizer.pad_token_id)
caption = processor.batch_decode(outputs, skip_special_tokens=True)
print("Generated Caption:", caption)
Performance
The Blip2-Typhoon1.5-COCO model achieves state-of-the-art performance on the COCO 2017 dataset, providing high-quality captions that are both accurate and descriptive.
Limitations and Future Work
While the model performs well on a wide range of images, there are limitations to its understanding and generation capabilities, especially in cases involving abstract concepts or highly specialized knowledge. Future work may include fine-tuning the model on more diverse datasets or integrating additional contextual information to enhance caption generation.
Acknowledgements
This model is built upon the work of Salesforce and Typhoon teams. The COCO dataset was instrumental in training this model.
Citation
If you use this model in your research, please cite:
@misc{Blip2-Typhoon1.5-COCO,
author = {MagiBoss},
title = {Blip2-Typhoon1.5-COCO},
year = {2024},
publisher = {Hugging Face},
note = {https://huggingface.co/MagiBoss/Blip2-Typhoon1.5-COCO}
}
- Downloads last month
- 14
Model tree for MagiBoss/Blip2-Typhoon1.5-COCO
Base model
Salesforce/blip2-opt-2.7b-coco