Update README.md
Browse files
README.md
CHANGED
@@ -7,6 +7,8 @@ pipeline_tag: image-text-to-text
|
|
7 |
tags:
|
8 |
- multimodal
|
9 |
- aria
|
|
|
|
|
10 |
---
|
11 |
<!-- <p align="center">
|
12 |
<br>Aria</br>
|
@@ -25,7 +27,7 @@ This checkpoint is one of base models of [Aria](https://huggingface.co/rhymes-ai
|
|
25 |
|
26 |
<img src="./aria-stages.png" alt="Aria Training Stages" style="width: 100%;">
|
27 |
|
28 |
-
Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/
|
29 |
|
30 |
<!--
|
31 |
- Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
|
@@ -33,12 +35,12 @@ Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/teowu/Ari
|
|
33 |
- Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
|
34 |
-->
|
35 |
|
36 |
-
## Aria-Base-
|
37 |
|
38 |
- **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
|
39 |
- **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
|
40 |
- **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
|
41 |
-
- **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/
|
42 |
- ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly used with chat templates.
|
43 |
|
44 |
<!-- # Model Info
|
@@ -68,7 +70,7 @@ import torch
|
|
68 |
from PIL import Image
|
69 |
from transformers import AutoModelForCausalLM, AutoProcessor
|
70 |
|
71 |
-
model_id_or_path = "
|
72 |
|
73 |
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
74 |
|
|
|
7 |
tags:
|
8 |
- multimodal
|
9 |
- aria
|
10 |
+
base_model:
|
11 |
+
- rhymes-ai/Aria-Base-8K
|
12 |
---
|
13 |
<!-- <p align="center">
|
14 |
<br>Aria</br>
|
|
|
27 |
|
28 |
<img src="./aria-stages.png" alt="Aria Training Stages" style="width: 100%;">
|
29 |
|
30 |
+
Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/rhymes-ai/Aria-Base-8K).
|
31 |
|
32 |
<!--
|
33 |
- Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
|
|
|
35 |
- Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
|
36 |
-->
|
37 |
|
38 |
+
## Aria-Base-64K
|
39 |
|
40 |
- **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
|
41 |
- **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
|
42 |
- **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
|
43 |
+
- **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/rhymes-ai/Aria-Base-8K).
|
44 |
- ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly used with chat templates.
|
45 |
|
46 |
<!-- # Model Info
|
|
|
70 |
from PIL import Image
|
71 |
from transformers import AutoModelForCausalLM, AutoProcessor
|
72 |
|
73 |
+
model_id_or_path = "rhymes-ai/Aria-Base-64K"
|
74 |
|
75 |
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
76 |
|