Generate text based on images and prompts
OpenAI's Deep Research, but open
Blind vote on HF TTS models!