File size: 11,811 Bytes

---
library_name: transformers
datasets:
- ucsahin/Turkish-VLM-Mix-Benchmark
language:
- tr
pipeline_tag: image-text-to-text
license: apache-2.0
---

<!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
<div style="text-align: center;">
    <img src="logo-no-background.png" alt="logo" style="width: 70%; height: auto;">
</div>
<!-- Provide a quick summary of what the model is/does. -->

## English
# 🎉 Introducing TraVisionLM: The First of Its Kind! 🚀

🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️

Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖


## Türkçe
# 🎉 TraVisionLM: Türünün İlk Örneği! 🚀

🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟

✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️

Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖

---

# Model Details

## English
This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together. 
Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.

Here's the summary of the development process:

1) **Unimodal pretraining**
    - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
2) **Feature Alignment**
    - Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train only the vision projector using 500K image-text pairs to align visual and textual features.
3) **Task Specific Training**
    - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
4) **Finetuning on Downstream Tasks**
    - Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) for more details.


## Türkçe
Bu model, [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) görsel kodlayıcısını ve [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir.
Mimarisi, [PaliGemma](https://arxiv.org/pdf/2407.07726) ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.

Geliştirme sürecinin özeti:

1) **Tek Modalite Ön Eğitimi**
    - Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) modelinin görsel kodlayıcısını ve [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) modelinin dil kodlayıcısını kullanıyorum.
2) **Özellik Uyarlama**
    - [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
3) **Görev Spesifik Eğitim**
    - Bu adımda, uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitilmiştir; 1M'den fazla resim-istek-tamamlanma üçlüsünden oluşan veri seti kullanılmıştır.
4) **İndirgeme Görevlerinde İnce Ayar**
    - Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayarı yapılmıştır. Nesne tespiti için ince ayar yapılmış modele detaylar için [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) adresinden ulaşabilirsiniz.


### Model Description
<!-- Provide a longer summary of what this model is. -->

- **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- **Language(s) (NLP):** *Turkish*
- **License:** *Apache license 2.0*

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
- **Paper [optional]:** More info on this later.
- **Demo [optional]:** [More Information Needed]

---

# Friendly Reminder:
First of all, thanks for your interest if you plan to use this model. I developed this model to primarily show that you can build 

# Kullanıcılar için Önemli Bir Hatırlatma:

---


## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

[More Information Needed]

### Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

[More Information Needed]

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

[More Information Needed]


## Türkçe: Kullanım Alanları

Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın.

### Doğrudan Kullanım Alanları
 - **Kısa Açıklama**
  
 - **Detaylı Açıklama**

 - **Görsel Soru Cevaplama**

   
### Dolaylı Kullanım Alanları
 - (*Video-Text-to-Text*) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir.
 - (*Retrieval*) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir.
 - (*Finetuning*) Model mimarisini destekleyen görsel sınıflandırma gibi geri kalan bütün görevler için model Transformers kütüphanesiyle uyumlu bir şekilde eğitilebilir. Bir örnek için [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) adresine bakabilirsiniz.

```Zaman buldukça bu dolaylı kullanım uygulamaları ile paylaşımlar yapmayı planlıyorum. Bu sürede topluluktan da destek ya da işbirliği isteklerini dört gözle bekliyorum``` 🤝💪

### Alan-dışı Kullanımlar
Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir:
 - Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz.
 - Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) (sadece ingilizce) modeline bakabilirsiniz.
 - Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tanıma görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) ve [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir.


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[More Information Needed]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing [optional]

[More Information Needed]


#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]


[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

More information will come 



### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]




## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]


## Model Card Contact

If you have questions or suggestions regarding the model, I prefer if you would reach me directly via Hugging Face (e.g. opening an issue). But if you have specific things in your mind or any ideas for collaboration on future projects, reach me at sahin.umitcan@gmail.com

Modelle ilgili sorularınız veya önerileriniz varsa, doğrudan bana Hugging Face üzerinden (örneğin, bir issue açarak) ulaşmanızı tercih ederim. Diğer konular veya gelecekteki projelerde işbirliği için herhangi bir fikriniz varsa, bana sahin.umitcan@gmail.com adresinden ulaşabilirsiniz.