I would like to propose adding support for multimodal capabilities to the Hugging Face API, enabling the use of vision models alongside text models. This would allow developers to pass both image and text data into models that can process both modalities

codelion changed pull request status to merged

I had to rollback the changes as the image inputs are not passed into the client in a way that is compatible with OpenAI SDK. Happy to look at it again if you manage to fix it.

Sign up or log in to comment