Engage in multi-modal conversations with images and videos
Transcribe audio or YouTube videos to text
Transfer portrait styles to images and videos