HuggingFaceTB
/

SmolVLM-Instruct

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

merve HF staff commited on Dec 19, 2024

Commit

8f3e22e

·

verified ·

1 Parent(s): 7fb3550

Add FT tutorial link

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ SmolVLM is a compact open multimodal model that accepts arbitrary sequences of i
 SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.
-To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
 <!-- todo: add link to fine-tuning tutorial -->
 ### Technical Summary

 SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.
+To fine-tune SmolVLM on a specific task, you can follow the [fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb).
 <!-- todo: add link to fine-tuning tutorial -->
 ### Technical Summary