Engage in multi-modal conversations with images and videos
Transcribe or translate audio and YouTube videos
Transfer portrait styles to images and videos