First version of TeLVE!

# The TeLVE v1.0
![TeLVE v1.png](https://cdn-uploads.huggingface.co/production/uploads/63417787a7582111c3f50df8/XrMTQ_yPOlqQJkwGCt58D.png)

Files changed (11) hide show

.gitattributes +1 -0
README.md +69 -3
images/mugla.jpg +3 -0
imagine.py +103 -0
main.py +167 -0
models/TeLVE_v1.0.pth +3 -0
teLVE_logo.png +0 -0
tokenizer/special_tokens_map.json +7 -0
tokenizer/tokenizer.json +0 -0
tokenizer/tokenizer_config.json +58 -0
tokenizer/vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+images/mugla.jpg filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,69 @@
----
-license: cc-by-4.0
----

+# TeLVE: Turkish efficient Language Vision Engine 🧿
+[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
+[![Models: v1.0](https://img.shields.io/badge/Models-v1.0-blue)](https://huggingface.co/outsu/TeLVE)
+## First Turkish VLM ever!
+TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing.
+ No module named 'imagine'
+![TeLVE logo](<teLVE_logo.png>)
+## Model Description
+TeLVE combines:
+- 🖼️ Vision Transformer (ViT-base-patch16-224)
+- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
+- 🔄 Cross-attention mechanism for vision-language fusion
+### Version Logs
+- **TeLVE v1.0**: Trained on Unsplash Lite dataset
+## Usage
+The model can be used in two ways:
+### Inference (imagine.py)
+```python
+# Generate captions for images
+python imagine.py
+```
+This script:
+- Loads a trained TeLVE model
+- Takes images from `images` directory
+- Generates Turkish captions for each image
+- Outputs the results to console
+### Training (main.py)
+Users can train their own models with ViT and BERT encoders.
+```python
+# Train a new model
+python main.py
+```
+This script:
+- Loads and preprocesses image-caption pairs
+- Initializes ViT and BERT encoders
+- Trains the combined model
+- Saves the model and tokenizer
+## Performance
+Performance scores will be evaluated.
+<!--
+| Model Version | Dataset | BLEU-4 | METEOR | CIDEr |
+|--------------|---------|---------|---------|--------|
+| TeLVE v1.0   | Unsplash | *TBD*   | *TBD*   | *TBD*  |
+| TeLVE v1.1   | Unsplash+Pexels | *TBD* | *TBD* | *TBD* |-->
+## Citation
+```bibtex
+@software{telve2024,
+    author = {Öğüt Su Karagün},
+    title = {TeLVE: Turkish efficient Language Vision Engine},
+    year = {2024},
+    url = {https://huggingface.co/outsu/TeLVE}
+}
+```
+## License
+This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

images/mugla.jpg ADDED Viewed

Git LFS Details

SHA256: 65b8124c02dc5afefbfd1ac848b633c67301215a3dc4f9c0b8c84790572bf7ea
Pointer size: 132 Bytes
Size of remote file: 4.32 MB

imagine.py ADDED Viewed

	@@ -0,0 +1,103 @@

+import torch
+import torch.nn as nn
+from torchvision import transforms
+from transformers import ViTModel, BertTokenizerFast, BertConfig, BertLMHeadModel
+from PIL import Image
+import os
+# Check if CUDA is available
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Using device: {device}")
+# Define constants
+VIT_MODEL_NAME = "google/vit-base-patch16-224"
+BERT_MODEL_NAME = "dbmdz/bert-base-turkish-cased"
+MAX_LENGTH = 128
+class ImageCaptioningModel(nn.Module):
+    def __init__(self, vit_model, bert_model):
+        super(ImageCaptioningModel, self).__init__()
+        self.vit = vit_model
+        self.bert = bert_model
+        self.linear = nn.Linear(self.vit.config.hidden_size, self.bert.config.hidden_size)
+    def forward(self, pixel_values, input_ids, attention_mask, labels=None):
+        image_features = self.vit(pixel_values).last_hidden_state
+        image_features = self.linear(image_features)
+        outputs = self.bert(input_ids=input_ids,
+                            attention_mask=attention_mask,
+                            encoder_hidden_states=image_features,
+                            labels=labels,
+                            return_dict=True)
+        return outputs.loss, outputs.logits
+def load_model(model_path):
+    # Initialize the model components
+    vit_model = ViTModel.from_pretrained(VIT_MODEL_NAME)
+    bert_config = BertConfig.from_pretrained(BERT_MODEL_NAME)
+    bert_config.is_decoder = True
+    bert_config.add_cross_attention = True
+    bert_model = BertLMHeadModel.from_pretrained(BERT_MODEL_NAME, config=bert_config)
+    # Create the combined model
+    model = ImageCaptioningModel(vit_model, bert_model)
+    model.load_state_dict(torch.load(model_path, map_location=device))
+    model.to(device)
+    model.eval()
+    return model
+def generate_caption(model, image_path, tokenizer):
+    # Prepare the image
+    transform = transforms.Compose([
+        transforms.Resize((224, 224)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+    ])
+    image = Image.open(image_path).convert('RGB')
+    image = transform(image).unsqueeze(0).to(device)
+    # Generate the caption
+    with torch.no_grad():
+        input_ids = torch.tensor([[tokenizer.cls_token_id]]).to(device)
+        attention_mask = torch.tensor([[1]]).to(device)
+        for _ in range(MAX_LENGTH):
+            _, logits = model(image, input_ids, attention_mask)
+            next_token = logits[:, -1, :].argmax(dim=-1)
+            if next_token.item() == tokenizer.sep_token_id:
+                break
+            input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)
+            attention_mask = torch.cat([attention_mask, torch.tensor([[1]]).to(device)], dim=1)
+    caption = tokenizer.decode(input_ids[0], skip_special_tokens=True)
+    return caption
+def main():
+    model_path = "./models/TeLVE_v1.1.pth"
+    tokenizer_path = "./tokenizer"
+    # Check if the model and tokenizer exist
+    if not os.path.exists(model_path) or not os.path.exists(tokenizer_path):
+        print("Model or tokenizer not found. Please make sure you have trained the model and saved it correctly.")
+        return
+    # Load the model and tokenizer
+    model = load_model(model_path)
+    tokenizer = BertTokenizerFast.from_pretrained(tokenizer_path)
+    # Generate captions for images in a specified directory
+    image_dir = "./images"  # Change this to the directory containing your test images
+    for image_file in os.listdir(image_dir):
+        if image_file.lower().endswith(('.png', '.jpg', '.jpeg')):
+            image_path = os.path.join(image_dir, image_file)
+            caption = generate_caption(model, image_path, tokenizer)
+            print(f"Image: {image_file}")
+            print(f"Generated Caption: {caption}")
+            print("---")
+if __name__ == "__main__":
+    main()

main.py ADDED Viewed

	@@ -0,0 +1,167 @@

+import os
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+from torchvision import transforms
+from transformers import ViTModel, BertTokenizerFast, BertConfig, BertLMHeadModel, AdamW
+from PIL import Image, ImageFile
+import pandas as pd
+from tqdm import tqdm
+# Increase the maximum image size limit to avoid DecompressionBombWarning
+Image.MAX_IMAGE_PIXELS = None
+# Allow loading truncated images
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+# Check if CUDA is available
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Using device: {device}")
+# Define constants
+VIT_MODEL_NAME = "google/vit-base-patch16-224"
+BERT_MODEL_NAME = "dbmdz/bert-base-turkish-cased"  # Using a Turkish BERT model
+model = "TeLVE_v1.0.pth"
+MAX_LENGTH = 128
+BATCH_SIZE = 8
+EPOCHS = 5
+LEARNING_RATE = 2e-5
+class ImageCaptioningDataset(Dataset):
+    def __init__(self, dataframe, img_dir, tokenizer):
+        self.dataframe = dataframe
+        self.img_dir = img_dir
+        self.tokenizer = tokenizer
+        self.transform = transforms.Compose([
+            transforms.Resize((224, 224)),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        ])
+    def __len__(self):
+        return len(self.dataframe)
+    def __getitem__(self, idx):
+        row = self.dataframe.iloc[idx]
+        img_path = os.path.join(self.img_dir, row['photo_id'] + ".jpg")
+        try:
+            image = Image.open(img_path).convert('RGB')
+            image = self.transform(image)
+        except (FileNotFoundError, IOError):
+            # Return None if the image is not found or cannot be opened
+            return None
+        caption = row['ai_description']
+        # Check if caption is a valid string
+        if not isinstance(caption, str):
+            return None  # Skip the example if caption is not valid
+        encoding = self.tokenizer(
+            caption,
+            add_special_tokens=True,
+            max_length=MAX_LENGTH,
+            padding='max_length',
+            truncation=True,
+            return_attention_mask=True,
+            return_tensors='pt'
+        )
+        return {
+            'pixel_values': image,
+            'input_ids': encoding['input_ids'].squeeze(),
+            'attention_mask': encoding['attention_mask'].squeeze(),
+            'labels': encoding['input_ids'].squeeze()  # Use input_ids as labels for calculating loss
+        }
+class ImageCaptioningModel(nn.Module):
+    def __init__(self, vit_model, bert_model):
+        super(ImageCaptioningModel, self).__init__()
+        self.vit = vit_model
+        self.bert = bert_model
+        self.linear = nn.Linear(self.vit.config.hidden_size, self.bert.config.hidden_size)
+    def forward(self, pixel_values, input_ids, attention_mask, labels=None):
+        image_features = self.vit(pixel_values).last_hidden_state
+        image_features = self.linear(image_features)
+        outputs = self.bert(input_ids=input_ids,
+                            attention_mask=attention_mask,
+                            encoder_hidden_states=image_features,
+                            labels=labels,
+                            return_dict=True)
+        return outputs.loss, outputs.logits
+def collate_fn(batch):
+    # Filter out None values (skipped images)
+    batch = list(filter(lambda x: x is not None, batch))
+    if len(batch) == 0:
+        return None
+    return {key: torch.stack([item[key] for item in batch]) for key in batch[0]}
+def train_vlm_model():
+    # Load and preprocess the dataset
+    encodings = ['utf-8', 'iso-8859-9', 'windows-1254']
+    for encoding in encodings:
+        try:
+            df = pd.read_csv('./datasets/' + model + '.tsv000', sep='\t', encoding=encoding)
+            print(f"Successfully read the file with {encoding} encoding.")
+            break
+        except UnicodeDecodeError:
+            print(f"Failed to read with {encoding} encoding. Trying next...")
+    else:
+        raise ValueError("Could not read the file with any of the specified encodings.")
+    # Initialize the tokenizer
+    tokenizer = BertTokenizerFast.from_pretrained(BERT_MODEL_NAME)
+    # Create the dataset and dataloader
+    dataset = ImageCaptioningDataset(df, '../download/images', tokenizer)
+    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
+    # Initialize the model components
+    vit_model = ViTModel.from_pretrained(VIT_MODEL_NAME)
+    bert_config = BertConfig.from_pretrained(BERT_MODEL_NAME)
+    bert_config.is_decoder = True
+    bert_config.add_cross_attention = True
+    bert_model = BertLMHeadModel.from_pretrained(BERT_MODEL_NAME, config=bert_config)
+    # Create the combined model
+    model = ImageCaptioningModel(vit_model, bert_model)
+    model.to(device)
+    # Define optimizer
+    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
+    # Training loop
+    model.train()
+    for epoch in range(EPOCHS):
+        total_loss = 0
+        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{EPOCHS}")
+        for batch in progress_bar:
+            if batch is None:
+                continue
+            pixel_values = batch['pixel_values'].to(device)
+            input_ids = batch['input_ids'].to(device)
+            attention_mask = batch['attention_mask'].to(device)
+            labels = batch['labels'].to(device)
+            optimizer.zero_grad()
+            loss, _ = model(pixel_values, input_ids, attention_mask, labels)
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+            progress_bar.set_postfix({'loss': loss.item()})
+        print(f"Epoch {epoch+1}/{EPOCHS}, Average Loss: {total_loss/len(dataloader)}")
+    # Save the model
+    torch.save(model.state_dict(), "./models/" + model)
+    tokenizer.save_pretrained("./tokenizer")
+if __name__ == "__main__":
+    train_vlm_model()

models/TeLVE_v1.0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c79764aa75a603efead82246db2078c4d2c07edbdf218ec8719f7817f5728c68
+size 904212666

teLVE_logo.png ADDED Viewed

tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

tokenizer/vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff