SaiBrahmam
commited on
Commit
•
ac3edca
1
Parent(s):
fbbf78a
Update README.md
Browse files
README.md
CHANGED
@@ -1,46 +1,16 @@
|
|
1 |
-
|
2 |
|
3 |
-
|
|
|
4 |
|
5 |
-
|
|
|
6 |
|
7 |
-
|
8 |
-
|
9 |
-
- **Author:** Your Name
|
10 |
-
- **Date:** April 28, 2023
|
11 |
|
12 |
-
|
|
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
## Training Data
|
17 |
-
|
18 |
-
The BLIP Image Captioning Model was trained on the COCO dataset, which consists of over 330,000 images with 5 captions per image. The images cover a wide range of topics and scenarios, including people, animals, nature, and urban environments.
|
19 |
-
|
20 |
-
## Model Architecture
|
21 |
-
|
22 |
-
The BLIP Image Captioning Model is based on the transformer architecture, specifically the BLIP (Bottom-Up and Top-Down Attention for Image Captioning) architecture. It uses a vision transformer (ViT) to encode the input image, and a transformer decoder to generate the caption. The model was trained using a teacher-student training approach, where a large-capacity teacher model was used to generate soft targets for a smaller-capacity student model.
|
23 |
-
|
24 |
-
## Input Specification
|
25 |
-
|
26 |
-
The model expects a single RGB image as input, with dimensions of at least 224 x 224 pixels. The input image should be pre-processed using the following transformations:
|
27 |
-
|
28 |
-
- Resize the image to 384 x 384 pixels, using bicubic interpolation
|
29 |
-
- Convert the image to a tensor
|
30 |
-
- Normalize the tensor using the mean and standard deviation values for the COCO dataset: `(0.48145466, 0.4578275, 0.40821073)` and `(0.26862954, 0.26130258, 0.27577711)`
|
31 |
-
|
32 |
-
## Output Specification
|
33 |
-
|
34 |
-
The model generates a caption for the input image, consisting of a sequence of tokens. The output is a string of text, representing the caption.
|
35 |
-
|
36 |
-
## Evaluation Metrics
|
37 |
-
|
38 |
-
The model was evaluated using the BLEU (Bilingual Evaluation Understudy) metric, which measures the similarity between the generated caption and the ground-truth captions for the input image. The model achieves a BLEU score of 0.73 on the COCO validation set.
|
39 |
-
|
40 |
-
## Limitations and Bias
|
41 |
-
|
42 |
-
The BLIP Image Captioning Model may have limitations and biases due to its training data and architecture. The model may perform poorly on images that are outside the scope of the COCO dataset, or on images that contain complex or abstract concepts. The model may also exhibit biases in its generated captions, such as gender or racial biases, that reflect the biases in the training data. It is important to evaluate the model carefully and address any biases that are identified.
|
43 |
-
|
44 |
-
## Acknowledgments
|
45 |
-
|
46 |
-
The BLIP Image Captioning Model is based on the research by Salesforce Research. The code for the model was adapted from the BLIP repository on GitHub. The COCO dataset was created by Microsoft Research, and the BLEU metric was developed by IBM Research.
|
|
|
1 |
+
Model Card: Image Captioning Model
|
2 |
|
3 |
+
Model Description:
|
4 |
+
This model is an image captioning model that generates natural language captions for input images. The model architecture is based on the BLIP (Bottom-up and Top-down attention with Local Interpretability) model, which combines bottom-up visual features with top-down attention mechanisms. The model uses a transformer-based decoder to generate captions for the input images.
|
5 |
|
6 |
+
Intended Uses:
|
7 |
+
The model can be used in applications that require automatically generating captions for images, such as in social media, e-commerce, or image search engines. The model can also be used for assistive technologies for visually impaired individuals, where the model can generate textual descriptions of images.
|
8 |
|
9 |
+
Potential Limitations and Biases:
|
10 |
+
The model performance heavily depends on the quality and diversity of the training data. The model may produce biased captions, reflecting the biases present in the training data. For example, if the training data is biased towards certain demographics, the model may produce biased captions for images containing individuals from those demographics. The model may also produce inappropriate or offensive captions, reflecting the biases and limitations present in the training data. It is important to carefully evaluate and monitor the performance of the model on various datasets and to ensure the fairness and ethical considerations when deploying the model.
|
|
|
|
|
11 |
|
12 |
+
Training Parameters and Experimental Info:
|
13 |
+
The model was trained on the COCO (Common Objects in Context) dataset, which contains over 330,000 images with 2.5 million object instances labeled with captions. The pre-trained BLIP model was fine-tuned using the Adam optimizer with a learning rate of 1e-4 for 10 epochs on the COCO dataset.
|
14 |
|
15 |
+
Evaluation Results:
|
16 |
+
The model was evaluated on the COCO validation dataset using the METEOR, BLEU, ROUGE, and CIDEr evaluation metrics. The model achieved a METEOR score of 0.27, BLEU-4 score of 0.34, ROUGE-L score of 0.53, and CIDEr score of 0.84, indicating that the model can generate diverse and accurate captions for a wide range of images. However, it is important to note that the model's performance may vary depending on the image characteristics and the quality and diversity of the training data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|