---
library_name: transformers
license: mit
language:
- th
pipeline_tag: image-to-text
datasets:
- MagiBoss/COCO-Image-Captioning
base_model:
- Salesforce/blip2-opt-2.7b-coco
- scb10x/llama-3-typhoon-v1.5-8b
---

# Blip2-Typhoon1.5-COCO

## Model Description

Blip2-Typhoon1.5-COCO is a powerful image captioning model designed to generate descriptive captions for images. This model leverages the strengths of both the BLIP2 and Typhoon architectures to provide high-quality, contextually accurate descriptions. The base models used are:

- **Encoder**: [Salesforce/blip2-opt-2.7b-coco](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco)
- **Decoder**: [scb10x/llama-3-typhoon-v1.5-8b](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b)

The BLIP2 encoder extracts visual features from images, while the Typhoon decoder generates natural language descriptions based on these features.

## Training Data

This model was trained on the COCO 2017 dataset, a widely-used benchmark dataset for image captioning tasks. The dataset includes a diverse set of images along with multiple human-generated captions for each image, enabling the model to learn rich and varied descriptive capabilities.

## Training Details

- **Datasets**: COCO 2017
- **Encoder**: Salesforce/blip2-opt-2.7b-coco
- **Decoder**: scb10x/llama-3-typhoon-v1.5-8b
- **Training Framework**: [Hugging Face Transformers](https://huggingface.co/transformers/)
- **Hardware**: High-performance GPUs for efficient training

## Usage

The Blip2-Typhoon1.5-COCO model can be used to generate captions for a wide variety of images. Here's how to use the model:

```python
from PIL import Image
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration

# Load the processor and the model
processor = Blip2Processor.from_pretrained("MagiBoss/Blip2-Typhoon1.5-COCO")
model = Blip2ForConditionalGeneration.from_pretrained("MagiBoss/Blip2-Typhoon1.5-COCO", torch_dtype=torch.bfloat16)

# Prepare an image
image = Image.open("Your image...").convert("RGB")

# Generate a caption
inputs = processor(images=image, return_tensors="pt", padding=True).to(device, torch.bfloat16)
outputs = model.generate(**inputs, max_length=55, pad_token_id=processor.tokenizer.pad_token_id)
caption = processor.batch_decode(outputs, skip_special_tokens=True)

print("Generated Caption:", caption)
```

# Performance
The Blip2-Typhoon1.5-COCO model achieves state-of-the-art performance on the COCO 2017 dataset, providing high-quality captions that are both accurate and descriptive.

# Limitations and Future Work
While the model performs well on a wide range of images, there are limitations to its understanding and generation capabilities, especially in cases involving abstract concepts or highly specialized knowledge. Future work may include fine-tuning the model on more diverse datasets or integrating additional contextual information to enhance caption generation.

# Acknowledgements
This model is built upon the work of Salesforce and Typhoon teams. The COCO dataset was instrumental in training this model.

# Citation
If you use this model in your research, please cite:
```css
@misc{Blip2-Typhoon1.5-COCO,
  author = {MagiBoss},
  title = {Blip2-Typhoon1.5-COCO},
  year = {2024},
  publisher = {Hugging Face},
  note = {https://huggingface.co/MagiBoss/Blip2-Typhoon1.5-COCO}
}
```