--- library_name: transformers datasets: - ucsahin/Turkish-VLM-Mix-Benchmark language: - tr pipeline_tag: image-text-to-text license: apache-2.0 ---

## English # 🎉 Introducing TraVisionLM: The First of Its Kind! 🚀 🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟 ✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️ Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖 ## Türkçe # 🎉 TraVisionLM: Türünün İlk Örneği! 🚀 🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟 ✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️ Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖 --- # Model Details ## English This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together. Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling. Here's the summary of the development process: 1) **Unimodal pretraining** - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large). 2) **Feature Alignment** - Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train only the vision projector using 500K image-text pairs to align visual and textual features. 3) **Task Specific Training** - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets. 4) **Finetuning on Downstream Tasks** - Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) for more details. ## Türkçe Bu model, [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) görsel kodlayıcısını ve [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir. Mimarisi, [PaliGemma](https://arxiv.org/pdf/2407.07726) ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır. Geliştirme sürecinin özeti: 1) **Tek Modalite Ön Eğitimi** - Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) modelinin görsel kodlayıcısını ve [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) modelinin dil kodlayıcısını kullanıyorum. 2) **Özellik Uyarlama** - [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum. 3) **Görev Spesifik Eğitim** - Bu adımda, uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitilmiştir; 1M'den fazla resim-istek-tamamlanma üçlüsünden oluşan veri seti kullanılmıştır. 4) **İndirgeme Görevlerinde İnce Ayar** - Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayarı yapılmıştır. Nesne tespiti için ince ayar yapılmış modele detaylar için [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) adresinden ulaşabilirsiniz. ### Model Description - **Developed by:** [ucsahin](https://huggingface.co/ucsahin) - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text) - **Language(s) (NLP):** *Turkish* - **License:** *Apache license 2.0* ### Model Sources [optional] - **Repository:** [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md] - **Paper [optional]:** More info on this later. - **Demo [optional]:** [More Information Needed] --- # Friendly Reminder: First of all, thanks for your interest if you plan to use this model. I developed this model to primarily show that you can build # Kullanıcılar için Önemli Bir Hatırlatma: --- ## Uses ### Direct Use [More Information Needed] ### Downstream Use [optional] [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Türkçe: Kullanım Alanları Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın. ### Doğrudan Kullanım Alanları - **Kısa Açıklama** - **Detaylı Açıklama** - **Görsel Soru Cevaplama** ### Dolaylı Kullanım Alanları - (*Video-Text-to-Text*) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir. - (*Retrieval*) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir. - (*Finetuning*) Model mimarisini destekleyen görsel sınıflandırma gibi geri kalan bütün görevler için model Transformers kütüphanesiyle uyumlu bir şekilde eğitilebilir. Bir örnek için [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) adresine bakabilirsiniz. ```Zaman buldukça bu dolaylı kullanım uygulamaları ile paylaşımlar yapmayı planlıyorum. Bu sürede topluluktan da destek ya da işbirliği isteklerini dört gözle bekliyorum``` 🤝💪 ### Alan-dışı Kullanımlar Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir: - Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz. - Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) (sadece ingilizce) modeline bakabilirsiniz. - Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tanıma görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) ve [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir. ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### Training Data [More Information Needed] ### Training Procedure #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] [More Information Needed] #### Metrics [More Information Needed] ### Results More information will come ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] ## Citation **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Model Card Contact If you have questions or suggestions regarding the model, I prefer if you would reach me directly via Hugging Face (e.g. opening an issue). But if you have specific things in your mind or any ideas for collaboration on future projects, reach me at sahin.umitcan@gmail.com Modelle ilgili sorularınız veya önerileriniz varsa, doğrudan bana Hugging Face üzerinden (örneğin, bir issue açarak) ulaşmanızı tercih ederim. Diğer konular veya gelecekteki projelerde işbirliği için herhangi bir fikriniz varsa, bana sahin.umitcan@gmail.com adresinden ulaşabilirsiniz.