You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By submitting this form, you agree to the License Agreement and acknowledge that the information you provide will be collected, used, and shared in accordance with Cohere’s Privacy Policy. You’ll receive email updates about C4AI and Cohere research, events, products and services. You can unsubscribe at any time.

Log in or Sign Up to review the conditions and access this model content.

Model Card for Aya Vision 32B

C4AI Aya Vision 32B is an open weights research release of a 32-billion parameter model with advanced capabilities optimized for a variety of vision-language use cases, including OCR, captioning, visual reasoning, summarization, question answering, code, and more. It is a multilingual model trained to excel in 23 languages in vision and language.

This model card corresponds to the 32-billion version of the Aya Vision model. We also released an 8-billion version which you can find here.

Try it: Aya Vision in Action

Before downloading the weights, you can try Aya Vision 32B chat in the Cohere playground or our dedicated Hugging Face Space for interactive exploration.

WhatsApp Integration

You can also talk to Aya Vision through the popular messaging service WhatsApp. Use this link to open a WhatsApp chatbox with Aya Vision.

If you don’t have WhatsApp downloaded on your machine you might need to do that, or, if you have it on your phone, you can follow the on-screen instructions to link your phone and WhatsApp Web. By the end, you should see a text window which you can use to chat with the model. More details about our WhatsApp integration are available here.

Example Notebook

You can check out the following notebook to understand how to use Aya Vision for different use cases.

How to Use Aya Vision

Please install transformers from the source repository that includes the necessary changes for this model:

# pip install 'git+https://github.com/huggingface/transformers.git@v4.49.0-AyaVision'
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "CohereForAI/aya-vision-32b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
    ]},
    ]

inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=300, 
    do_sample=True, 
    temperature=0.3,
)

print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

You can also use the model directly using transformers pipeline abstraction:

from transformers import pipeline

pipe = pipeline(model="CohereForAI/aya-vision-32b", task="image-text-to-text", device_map="auto")

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
        {"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
    ]},
    ]
outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)

print(outputs)

Model Details

Input: Model accepts input text and images.

Output: Model generates text.

Model Architecture: This is a vision-language model that uses a state-of-the-art multilingual language model, Aya Expanse 32B, which is trained with Aya Expanse recipe, paired with SigLIP2-patch14-384 vision encoder through a multimodal adapter for vision-language understanding.

Image Processing: We use 169 visual tokens to encode an image tile with a resolution of 364x364 pixels. Input images of arbitrary sizes are mapped to the nearest supported resolution based on the aspect ratio. Aya Vision uses up to 12 input tiles and a thumbnail (resized to 364x364) (2197 image tokens).

Languages covered: The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese (Simplified and Traditional), Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.

Context length: Aya Vision 32B supports a context length of 16K.

For more details about how the model was trained, check out our blogpost.

Evaluation

We evaluated Aya Vision 32B against Llama-3.2 90B Vision, Molmo 72B, Qwen2.5-VL 72B using Aya Vision Benchmark and m-WildVision. Win-rates were determined using claude-3-7-sonnet-20250219 as a judge, based on the superior judge performance compared to other models.

We also evaluated Aya Vision 32B’s performance for text-only input against the same models using m-ArenaHard, a challenging open-ended generation evaluation, measured using win-rates using gpt-4o-2024-11-20 as a judge.

Model Card Contact

For errors or additional questions about details in this model card, contact info@for.ai.

Terms of Use

We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of a highly performant 32 billion parameter Vision-Language Model to researchers all over the world.

This model is governed by a CC-BY-NC License with an acceptable use addendum, and also requires adhering to C4AI's Acceptable Use Policy.

Downloads last month
8
Safetensors
Model size
33.1B params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Space using CohereForAI/aya-vision-32b 1

Collection including CohereForAI/aya-vision-32b