HuggingFaceTB
/

SmolVLM-Instruct

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

Multi-image but single-turn example

#3

by pcuenq HF staff - opened Nov 25, 2024

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Files changed (1) hide show

README.md +5 -14

README.md CHANGED Viewed

@@ -74,22 +74,10 @@ messages = [
         "role": "user",
         "content": [
             {"type": "image"},
-            {"type": "text", "text": "What do we see in this image?"}
-        ]
-    },
-    {
-        "role": "assistant",
-        "content": [
-            {"type": "text", "text": "This image shows a city skyline with prominent landmarks."}
-        ]
-    },
-    {
-        "role": "user",
-        "content": [
             {"type": "image"},
-            {"type": "text", "text": "And how about this image?"}
         ]
-    }
 ]
 # Prepare inputs
@@ -108,6 +96,9 @@ print(generated_texts[0])
 ```
 ### Model optimizations
 **Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.

         "role": "user",
         "content": [
             {"type": "image"},
             {"type": "image"},
+            {"type": "text", "text": "Can you describe the two images?"}
         ]
+    },
 ]
 # Prepare inputs
 ```
+> The first image shows a statue of the Statue of Liberty in front of a city skyline. The statue is green and is on a pedestal. The city skyline includes many tall buildings and skyscrapers. The sky is clear and blue. The water in the foreground is calm and blue. The second image shows a bee on a pink flower. The flower is surrounded by green leaves.
 ### Model optimizations
 **Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.