tarekziade
/

distilvit

vision-encoder-decoder

image-text-to-text

image-captioning

Inference Endpoints

Model card Files Files and versions Community

This model is a variation of https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

Read the blog post here https://ziade.org/2024/03/17/distilvit-image-captioning-model
The training code is here: https://github.com/tarekziade/distilvit

Results after after 3 epochs (and ~45 hours of training)

eval_loss: 0.19939416646957397
eval_rouge1: 43.006
eval_rouge2: 16.9939
eval_rougeL: 38.8923
eval_rougeLsum: 38.8877
eval_gen_len: 11.327256736227712
eval_runtime: 1816.5255
eval_samples_per_second: 13.77
eval_steps_per_second': 1.721
train_runtime: 46263.3695
train_samples_per_second: 38.373
train_steps_per_second: 4.797
train_loss: 0.05974134062104816

Downloads last month: 98

Safetensors

Model size

182M params

Tensor type

F32

·

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

Model tree for tarekziade/distilvit

Base model

distilbert/distilgpt2

Quantized

(14)

this model