Files changed (1) hide show
  1. README.md +45 -0
README.md CHANGED
@@ -1,3 +1,48 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - atasoglu/flickr8k-turkish
5
+ language:
6
+ - tr
7
+ metrics:
8
+ - rouge
9
+ library_name: transformers
10
+ pipeline_tag: image-to-text
11
+ tags:
12
+ - image-to-text
13
+ - image-captioning
14
+ base_model:
15
+ - google/vit-base-patch16-224
16
+ - ytu-ce-cosmos/turkish-gpt2
17
  ---
18
+
19
+ # vit-base-patch16-224-turkish-gpt2-medium
20
+
21
+ This vision encoder-decoder model utilizes the [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) as the encoder and [ytu-ce-cosmos/turkish-gpt2-medium](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-medium) as the decoder, and it has been fine-tuned on the [flickr8k-turkish](https://huggingface.co/datasets/atasoglu/flickr8k-turkish) dataset to generate image captions in Turkish.
22
+
23
+ ## Usage
24
+
25
+ ```py
26
+ import torch
27
+ from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
28
+ from PIL import Image
29
+
30
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
31
+ model_id = "atasoglu/vit-base-patch16-224-turkish-gpt2-medium"
32
+ img = Image.open("example.jpg")
33
+
34
+ feature_extractor = ViTImageProcessor.from_pretrained(model_id)
35
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
36
+ model = VisionEncoderDecoderModel.from_pretrained(model_id)
37
+ model.to(device)
38
+
39
+ features = feature_extractor(images=[img], return_tensors="pt")
40
+ pixel_values = features.pixel_values.to(device)
41
+
42
+ generated_captions = tokenizer.batch_decode(
43
+ model.generate(pixel_values, max_new_tokens=20),
44
+ skip_special_tokens=True,
45
+ )
46
+
47
+ print(generated_captions)
48
+ ```