bitpoint/ImageAnalysis · Hugging Face

# Model Card for vit-gpt2-image-captioning

## Model Details
This model is a VisionEncoderDecoderModel using a ViT encoder and GPT-2 decoder to generate captions for images. It was fine-tuned by adding context information to assist in generating meaningful captions.

- **Base Model**: nlpconnect/vit-gpt2-image-captioning
- **Processor**: ViTImageProcessor
- **Tokenizer**: GPT-2 Tokenizer
- **Generated Caption Example**: "{generated_text}"

## Intended Use
This model is intended for generating captions for stock-related images, with an initial context provided for more accurate descriptions.

## Limitations
- The model might generate incorrect or biased descriptions depending on the input image or context.
- It requires specific context inputs for the best performance.

## How to Use
```python
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
model = VisionEncoderDecoderModel.from_pretrained("your_username/your_model_name")
processor = ViTImageProcessor.from_pretrained("your_username/your_model_name")
tokenizer = AutoTokenizer.from_pretrained("your_username/your_model_name")
```

## License
This model is licensed under the same terms as the original nlpconnect/vit-gpt2-image-captioning.