Xkev
/

Llama-3.2V-11B-cot

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Add link to paper, update pipeline tag

#3

by nielsr HF staff - opened 4 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ language:
 - en
 base_model:
 - meta-llama/Llama-3.2-11B-Vision-Instruct
-pipeline_tag: visual-question-answering
 library_name: transformers
 ---
 # Model Card for Model ID
@@ -13,6 +13,8 @@ library_name: transformers
 Llama-3.2V-11B-cot is the first version of [LLaVA-o1](https://github.com/PKU-YuanGroup/LLaVA-o1), which is a visual language model capable of spontaneous, systematic reasoning.
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->

 - en
 base_model:
 - meta-llama/Llama-3.2-11B-Vision-Instruct
+pipeline_tag: image-text-to-text
 library_name: transformers
 ---
 # Model Card for Model ID
 Llama-3.2V-11B-cot is the first version of [LLaVA-o1](https://github.com/PKU-YuanGroup/LLaVA-o1), which is a visual language model capable of spontaneous, systematic reasoning.
+The model was proposed in [LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://huggingface.co/papers/2411.10440).
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->