Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ tags:
|
|
19 |
|
20 |
# InternVL3-1B
|
21 |
|
22 |
-
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[π InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[π InternVL3\]](
|
23 |
|
24 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/)
|
25 |
|
@@ -66,9 +66,9 @@ Notably, in InternVL3, we integrate the [Variable Visual Position Encoding (V2PE
|
|
66 |
|
67 |
### Native Multimodal Pre-Training
|
68 |
|
69 |
-
We propose a [Native Multimodal Pre-Training](
|
70 |
In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
|
71 |
-
Please see [our paper](
|
72 |
|
73 |
### Supervised Fine-Tuning
|
74 |
|
|
|
19 |
|
20 |
# InternVL3-1B
|
21 |
|
22 |
+
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[π InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[π InternVL3\]](https://huggingface.co/papers/2504.10479)
|
23 |
|
24 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/)
|
25 |
|
|
|
66 |
|
67 |
### Native Multimodal Pre-Training
|
68 |
|
69 |
+
We propose a [Native Multimodal Pre-Training](https://huggingface.co/papers/2504.10479) approach that consolidates language and vision learning into a single pre-training stage.
|
70 |
In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
|
71 |
+
Please see [our paper](https://huggingface.co/papers/2504.10479) for more details.
|
72 |
|
73 |
### Supervised Fine-Tuning
|
74 |
|