HuggingFaceTB
/

SmolVLM2-256M-Video-Instruct

Video-Text-to-Text

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

mfarre HF staff commited on 4 days ago

Commit

e083124

·

verified ·

1 Parent(s): f93dfb6

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -43,7 +43,7 @@ SmolVLM2-256M-Video is a lightweight multimodal model designed to analyze video
 SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
-To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
 ## Evaluation

 SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
+To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/vision/finetuning/Smol_VLM_FT.ipynb).
 ## Evaluation