metadata
language:
- fa
library_name: hezar
tags:
- image-to-text
- hezar
metrics:
- wer
pipeline_tag: image-to-text
datasets:
- hezarai/flickr30k-fa
A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on flickr30k-fa (created by Sajjad Ayoubi). The encoder (ViT) was initialized from https://huggingface.co/google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized from https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base .
Usage
pip install hezar
from hezar.models import Model
model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
captions = model.predict("example_image.jpg")
print(captions)