OpenGVLab
/

InternVL3-1B

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

Weiyun1025 commited on 14 days ago

Commit

2c408ff

·

verified ·

1 Parent(s): 0fb4585

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ tags:
 # InternVL3-1B
-[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](TBD)
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
@@ -66,9 +66,9 @@ Notably, in InternVL3, we integrate the [Variable Visual Position Encoding (V2PE
 ### Native Multimodal Pre-Training
-We propose a [Native Multimodal Pre-Training](TBD) approach that consolidates language and vision learning into a single pre-training stage.
 In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
-Please see [our paper](TBD) for more details.
 ### Supervised Fine-Tuning

 # InternVL3-1B
+[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 ### Native Multimodal Pre-Training
+We propose a [Native Multimodal Pre-Training](https://huggingface.co/papers/2504.10479) approach that consolidates language and vision learning into a single pre-training stage.
 In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
+Please see [our paper](https://huggingface.co/papers/2504.10479) for more details.
 ### Supervised Fine-Tuning