HuggingFaceTB
/

SmolVLM-Instruct

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

Add FT tutorial link

#22

by merve HF staff - opened 11 days ago

base: refs/heads/main

←

from: refs/pr/22

Discussion Files changed

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ SmolVLM is a compact open multimodal model that accepts arbitrary sequences of i
 SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.
-To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
 <!-- todo: add link to fine-tuning tutorial -->
 ### Technical Summary

 SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.
+To fine-tune SmolVLM on a specific task, you can follow the [fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb).
 <!-- todo: add link to fine-tuning tutorial -->
 ### Technical Summary