metadata

title: MLM CLIP PHI 2 LLAVA Chatbot
emoji: 🐠
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: mit

Multimodal AI Assistant

This Gradio app, hosted on HuggingFace Spaces, is an interactive demonstration of a multimodal AI system capable of processing image, text, and audio inputs to generate descriptive text outputs.

Overview

The app integrates advanced AI models to analyze and interpret various types of data, showcasing the power of multimodal AI systems in understanding and generating human-like descriptions.

Features

Image Input: Upload an image to receive a descriptive text output.
Text Input: Enter text to see how the AI interprets and describes it.
Audio Input: Using Whisperx technology, the app can process spoken language inputs and generate text descriptions.

How to Use

Upload or Enter Data: Upload your image or audio file, or type your text into the provided field.
Submit for Analysis: Click the 'Submit' button to send your input to the AI model.
View Output: The descriptive text output will be displayed on the screen.

Technologies Used

AI Model: A state-of-the-art model capable of understanding and generating text from various inputs.
Whisperx: Advanced audio processing for converting speech to text.
Gradio: An easy-to-use interface for creating customizable ML demos.

Contact

Training and implementation code is on (github)[https://github.com/Delve-ERAV1/Phi-2-Vision-Language/tree/main/Phi-2-CLIP-LLAVA-Instruction-Finetuning].

Acknowledgements

This project is inspired by advancements in multimodal AI, particularly the Large Language and Vision Assistant (LLaVA) project. Our implementation demonstrates the integration of language, vision, and audio processing in one coherent system.

COCO 2017 Dataset: Website
Instruct150K Dataset: [Link to dataset or reference]
OpenAI CLIP: Learn More
LLAVA 1.5 https://github.com/haotian-liu
Microsoft Phi-2