--- license: apache-2.0 datasets: - neulab/PangeaInstruct language: - am - ar - bg - bn - cs - de - el - en - es - fa - fr - ga - hi - id - ig - it - iw - ja - jv - ko - nl - mn - ms - no - pl - pt - ro - ru - si - su - sw - ta - te - th - tr - uk - ur - vi - zh base_model: - Qwen/Qwen2-7B-Instruct --- # Pangea-7B Model Card [Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages](https://neulab.github.io/Pangea/) 🇪🇹 🇸🇦 🇧🇬 🇧🇩 🇨🇿 🇩🇪 🇬🇷 🇬🇧 🇺🇸 🇪🇸 🇮🇷 🇫🇷 🇮🇪 🇮🇳 🇮🇩 🇳🇬 🇮🇹 🇮🇱 🇯🇵 🇮🇩 🇰🇷 🇳🇱 🇲🇳 🇲🇾 🇳🇴 🇵🇱 🇵🇹 🇧🇷 🇷🇴 🇷🇺 🇱🇰 🇮🇩 🇰🇪 🇹🇿 🇱🇰 🇹🇭 🇹🇷 🇺🇦 🇵🇰 🇻🇳 🇨🇳 🇹🇼 [🏠 Homepage](https://neulab.github.io/Pangea/) | [🤖 Pangea-7B](https://huggingface.co/neulab/Pangea-7B) | [📊 PangeaIns](https://huggingface.co/datasets/neulab/PangeaInstruct) | [🧪 PangeaBench](https://huggingface.co/collections/neulab/pangea-6713c3b0d78a453906eb2ed8) | [💻 Github](https://github.com/neulab/Pangea/tree/main) | [📄 Arxiv](https://arxiv.org/abs/2410.16153) | [📕 PDF](https://arxiv.org/pdf/2410.16153) | [🖥️ Demo](https://huggingface.co/spaces/neulab/Pangea) description

## Model details - **Model:** Pangea is a fully open-source Multilingual Multimodal Multicultural LLM. - **Date:** Pangea-7B was trained in 2024. - **Training Dataset:** [6M PangeaIns](https://huggingface.co/datasets/neulab/PangeaInstruct). - **Architecture:** Pangea-7B follows the architecture of [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT), with a [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) backbone. ## Uses Pangea-7B follows the architecture of [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). You could either (1) follow the same model loading procedures as of [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT), an example of loading Pangea-7B directly is shown in the Python code below, or (2) use our hf version of Pangea-7B: [Pangea-7B-hf]https://huggingface.co/neulab/Pangea-7B-hf ### Direct Use First you would need to clone and install [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). ```bash git clone https://github.com/LLaVA-VL/LLaVA-NeXT cd LLaVA-NeXT pip install -e ".[train]" ``` Then, you could load Pangea-7B using the following code: ```python from llava.model.builder import load_pretrained_model model_path = 'neulab/Pangea-7B' model_name = 'Pangea-7B-qwen' args = {"multimodal": True} tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, **args) ``` Defining some helper functions for using the model: ```python import torch from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.utils import disable_torch_init from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX from typing import Dict import transformers import re from PIL import Image def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict: roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"} im_start, im_end = tokenizer.additional_special_tokens_ids nl_tokens = tokenizer("\n").input_ids _system = tokenizer("system").input_ids + nl_tokens _user = tokenizer("user").input_ids + nl_tokens _assistant = tokenizer("assistant").input_ids + nl_tokens input_ids = [] source = sources if roles[source[0]["from"]] != roles["human"]: source = source[1:] input_id, target = [], [] system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens input_id += system target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens assert len(input_id) == len(target) for j, sentence in enumerate(source): role = roles[sentence["from"]] if has_image and sentence["value"] is not None and "" in sentence["value"]: num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"])) texts = sentence["value"].split('') _input_id = tokenizer(role).input_ids + nl_tokens for i,text in enumerate(texts): _input_id += tokenizer(text).input_ids if i