dmedhi
/

flosmolv

@@ -1,10 +1,38 @@
 ---
-pipeline_tag: text-generation
 tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
+pipeline_tag: image-text-to-text
 tags:
+- florence2
+- smollm
+- custom_code
+license: apache-2.0
 ---
+## FloSmolV
+A vision model for **Image-text to Text** generation produced by combining [HuggingFaceTB/SmolLM-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-360M-Instruct) and [microsoft/Florence-2-base](https://huggingface.co/microsoft/Florence-2-base).
+The **Florence2-base** models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to
+answer questions. **SmolLM-360M** is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a
+Visual Question Answering model which can produce answers from Images.
+## Usage
+### Transformers
+Make sure to install the necessary dependencies first.
+```bash
+pip install -qU transformers accelerate einops bitsandbytes flash_attn timm
+```
+```python
+# load a free image from pixabay
+from PIL import Image
+import requests
+url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg"
+img = Image.open(requests.get(url, stream=True).raw)
+# download model
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda()
+model(img, "what is the object in the image?")
+```