ucsahin
/

TraVisionLM-base

@@ -41,7 +41,7 @@ Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım
 This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together.
 Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.
-Here's a glimpse into the development process:
 1) **Unimodal pretraining**
     - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
@@ -64,9 +64,9 @@ Geliştirme sürecinin özeti:
 2) **Özellik Uyarlama**
     - [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
 3) **Görev Spesifik Eğitim**
-    - Uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitim alıyor; 1M'den fazla görüntü-istek-tamamlanma üçlüsü kullanılıyor.
 4) **İndirgeme Görevlerinde İnce Ayar**
-    - Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayar yapılmıştır. Nesne tespiti için ince ayar yapılmış modeli daha fazla detay için ucsahin/TraVisionLM-Object-Detection-ft adresinden keşfedebilirsiniz.
 ### Model Description
@@ -85,8 +85,17 @@ Geliştirme sürecinin özeti:
 - **Paper [optional]:** More info on this later.
 - **Demo [optional]:** [More Information Needed]
-## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
@@ -107,6 +116,33 @@ Geliştirme sürecinin özeti:
 [More Information Needed]
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
@@ -164,9 +200,6 @@ Use the code below to get started with the model.
 [More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 [More Information Needed]
@@ -178,7 +211,7 @@ Use the code below to get started with the model.
 ### Results
-[More Information Needed]
@@ -191,11 +224,9 @@ Use the code below to get started with the model.
 [More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -207,20 +238,9 @@ Use the code below to get started with the model.
 [More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together.
 Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.
+Here's the summary of the development process:
 1) **Unimodal pretraining**
     - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
 2) **Özellik Uyarlama**
     - [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
 3) **Görev Spesifik Eğitim**
+    - Bu adımda, uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitilmiştir; 1M'den fazla resim-istek-tamamlanma üçlüsünden oluşan veri seti kullanılmıştır.
 4) **İndirgeme Görevlerinde İnce Ayar**
+    - Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayarı yapılmıştır. Nesne tespiti için ince ayar yapılmış modele detaylar için [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) adresinden ulaşabilirsiniz.
 ### Model Description
 - **Paper [optional]:** More info on this later.
 - **Demo [optional]:** [More Information Needed]
+---
+# Friendly Reminder:
+First of all, thanks for your interest if you plan to use this model. I developed this model to primarily show that you can build
+# Kullanıcılar için Önemli Bir Hatırlatma:
+---
+## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
 [More Information Needed]
+## Türkçe: Kullanım Alanları
+Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın.
+### Doğrudan Kullanım Alanları
+ - **Kısa Açıklama**
+ - **Detaylı Açıklama**
+ - **Görsel Soru Cevaplama**
+### Dolaylı Kullanım Alanları
+ - (*Video-Text-to-Text*) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir.
+ - (*Retrieval*) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir.
+ - (*Finetuning*) Model mimarisini destekleyen görsel sınıflandırma gibi geri kalan bütün görevler için model Transformers kütüphanesiyle uyumlu bir şekilde eğitilebilir. Bir örnek için [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) adresine bakabilirsiniz.
+```Zaman buldukça bu dolaylı kullanım uygulamaları ile paylaşımlar yapmayı planlıyorum. Bu sürede topluluktan da destek ya da işbirliği isteklerini dört gözle bekliyorum``` 🤝💪
+### Alan-dışı Kullanımlar
+Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir:
+ - Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz.
+ - Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) (sadece ingilizce) modeline bakabilirsiniz.
+ - Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tanıma görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) ve [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir.
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 [More Information Needed]
 [More Information Needed]
 ### Results
+More information will come
 [More Information Needed]
+## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 [More Information Needed]
 ## Model Card Contact
+If you have questions or suggestions regarding the model, I prefer if you would reach me directly via Hugging Face (e.g. opening an issue). But if you have specific things in your mind or any ideas for collaboration on future projects, reach me at sahin.umitcan@gmail.com
+Modelle ilgili sorularınız veya önerileriniz varsa, doğrudan bana Hugging Face üzerinden (örneğin, bir issue açarak) ulaşmanızı tercih ederim. Diğer konular veya gelecekteki projelerde işbirliği için herhangi bir fikriniz varsa, bana sahin.umitcan@gmail.com adresinden ulaşabilirsiniz.