Model Card: Image Captioning Model Model Description: This model is an image captioning model that generates natural language captions for input images. The model architecture is based on the BLIP (Bottom-up and Top-down attention with Local Interpretability) model, which combines bottom-up visual features with top-down attention mechanisms. The model uses a transformer-based decoder to generate captions for the input images. Intended Uses: The model can be used in applications that require automatically generating captions for images, such as in social media, e-commerce, or image search engines. The model can also be used for assistive technologies for visually impaired individuals, where the model can generate textual descriptions of images. Potential Limitations and Biases: The model performance heavily depends on the quality and diversity of the training data. The model may produce biased captions, reflecting the biases present in the training data. For example, if the training data is biased towards certain demographics, the model may produce biased captions for images containing individuals from those demographics. The model may also produce inappropriate or offensive captions, reflecting the biases and limitations present in the training data. It is important to carefully evaluate and monitor the performance of the model on various datasets and to ensure the fairness and ethical considerations when deploying the model. Training Parameters and Experimental Info: The model was trained on the COCO (Common Objects in Context) dataset, which contains over 330,000 images with 2.5 million object instances labeled with captions. The pre-trained BLIP model was fine-tuned using the Adam optimizer with a learning rate of 1e-4 for 10 epochs on the COCO dataset. Evaluation Results: The model was evaluated on the COCO validation dataset using the METEOR, BLEU, ROUGE, and CIDEr evaluation metrics. The model achieved a METEOR score of 0.27, BLEU-4 score of 0.34, ROUGE-L score of 0.53, and CIDEr score of 0.84, indicating that the model can generate diverse and accurate captions for a wide range of images. However, it is important to note that the model's performance may vary depending on the image characteristics and the quality and diversity of the training data.