rhymes-ai
/

Aria-Base-64K

@@ -7,6 +7,8 @@ pipeline_tag: image-text-to-text
 tags:
 - multimodal
 - aria
 ---
 <!-- <p align="center">
   <br>Aria</br>
@@ -25,7 +27,7 @@ This checkpoint is one of base models of [Aria](https://huggingface.co/rhymes-ai
 <img src="./aria-stages.png" alt="Aria Training Stages" style="width: 100%;">
-Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/teowu/Aria-Base-8K).
 <!--
 - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
@@ -33,12 +35,12 @@ Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/teowu/Ari
 - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
  -->
-## Aria-Base-8K
 - **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
 - **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
 - **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
-- **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/teowu/Aria-Base-8K).
 - ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly used with chat templates.
 <!-- # Model Info
@@ -68,7 +70,7 @@ import torch
 from PIL import Image
 from transformers import AutoModelForCausalLM, AutoProcessor
-model_id_or_path = "teowu/Aria-Base-64K"
 model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

 tags:
 - multimodal
 - aria
+base_model:
+- rhymes-ai/Aria-Base-8K
 ---
 <!-- <p align="center">
   <br>Aria</br>
 <img src="./aria-stages.png" alt="Aria Training Stages" style="width: 100%;">
+Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/rhymes-ai/Aria-Base-8K).
 <!--
 - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
 - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
  -->
+## Aria-Base-64K
 - **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
 - **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
 - **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
+- **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/rhymes-ai/Aria-Base-8K).
 - ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly used with chat templates.
 <!-- # Model Info
 from PIL import Image
 from transformers import AutoModelForCausalLM, AutoProcessor
+model_id_or_path = "rhymes-ai/Aria-Base-64K"
 model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)