--- library_name: transformers license: mit language: - th pipeline_tag: image-to-text datasets: - MagiBoss/COCO-Image-Captioning base_model: - Salesforce/blip2-opt-2.7b-coco - scb10x/llama-3-typhoon-v1.5-8b --- # Blip2-Typhoon1.5-COCO ## Model Description Blip2-Typhoon1.5-COCO is a powerful image captioning model designed to generate descriptive captions for images. This model leverages the strengths of both the BLIP2 and Typhoon architectures to provide high-quality, contextually accurate descriptions. The base models used are: - **Encoder**: [Salesforce/blip2-opt-2.7b-coco](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco) - **Decoder**: [scb10x/llama-3-typhoon-v1.5-8b](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b) The BLIP2 encoder extracts visual features from images, while the Typhoon decoder generates natural language descriptions based on these features. ## Training Data This model was trained on the COCO 2017 dataset, a widely-used benchmark dataset for image captioning tasks. The dataset includes a diverse set of images along with multiple human-generated captions for each image, enabling the model to learn rich and varied descriptive capabilities. ## Training Details - **Datasets**: COCO 2017 - **Encoder**: Salesforce/blip2-opt-2.7b-coco - **Decoder**: scb10x/llama-3-typhoon-v1.5-8b - **Training Framework**: [Hugging Face Transformers](https://huggingface.co/transformers/) - **Hardware**: High-performance GPUs for efficient training ## Usage The Blip2-Typhoon1.5-COCO model can be used to generate captions for a wide variety of images. Here's how to use the model: ```python from PIL import Image import torch from transformers import Blip2Processor, Blip2ForConditionalGeneration # Load the processor and the model processor = Blip2Processor.from_pretrained("MagiBoss/Blip2-Typhoon1.5-COCO") model = Blip2ForConditionalGeneration.from_pretrained("MagiBoss/Blip2-Typhoon1.5-COCO", torch_dtype=torch.bfloat16) # Prepare an image image = Image.open("Your image...").convert("RGB") # Generate a caption inputs = processor(images=image, return_tensors="pt", padding=True).to(device, torch.bfloat16) outputs = model.generate(**inputs, max_length=55, pad_token_id=processor.tokenizer.pad_token_id) caption = processor.batch_decode(outputs, skip_special_tokens=True) print("Generated Caption:", caption) ``` # Performance The Blip2-Typhoon1.5-COCO model achieves state-of-the-art performance on the COCO 2017 dataset, providing high-quality captions that are both accurate and descriptive. # Limitations and Future Work While the model performs well on a wide range of images, there are limitations to its understanding and generation capabilities, especially in cases involving abstract concepts or highly specialized knowledge. Future work may include fine-tuning the model on more diverse datasets or integrating additional contextual information to enhance caption generation. # Acknowledgements This model is built upon the work of Salesforce and Typhoon teams. The COCO dataset was instrumental in training this model. # Citation If you use this model in your research, please cite: ```css @misc{Blip2-Typhoon1.5-COCO, author = {MagiBoss}, title = {Blip2-Typhoon1.5-COCO}, year = {2024}, publisher = {Hugging Face}, note = {https://huggingface.co/MagiBoss/Blip2-Typhoon1.5-COCO} } ```