File size: 2,051 Bytes
9cbbd7c
 
 
 
 
 
819da54
9cbbd7c
 
 
 
 
300318a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03f9279
 
 
 
819da54
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
title: MLM CLIP PHI 2 LLAVA Chatbot
emoji: 🐠
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: mit
---

# Multimodal AI Assistant

This Gradio app, hosted on HuggingFace Spaces, is an interactive demonstration of a multimodal AI system capable of processing image, text, and audio inputs to generate descriptive text outputs.

## Overview

The app integrates advanced AI models to analyze and interpret various types of data, showcasing the power of multimodal AI systems in understanding and generating human-like descriptions.

### Features

- **Image Input**: Upload an image to receive a descriptive text output.
- **Text Input**: Enter text to see how the AI interprets and describes it.
- **Audio Input**: Using Whisperx technology, the app can process spoken language inputs and generate text descriptions.

## How to Use

1. **Upload or Enter Data**: Upload your image or audio file, or type your text into the provided field.
2. **Submit for Analysis**: Click the 'Submit' button to send your input to the AI model.
3. **View Output**: The descriptive text output will be displayed on the screen.

## Technologies Used

- **AI Model**: A state-of-the-art model capable of understanding and generating text from various inputs.
- **Whisperx**: Advanced audio processing for converting speech to text.
- **Gradio**: An easy-to-use interface for creating customizable ML demos.

## Contact

Training and implementation code is on (github)[https://github.com/Delve-ERAV1/Phi-2-Vision-Language/tree/main/Phi-2-CLIP-LLAVA-Instruction-Finetuning].


## Acknowledgements

This project is inspired by advancements in multimodal AI, particularly the Large Language and Vision Assistant (LLaVA) project. Our implementation demonstrates the integration of language, vision, and audio processing in one coherent system.


- COCO 2017 Dataset: Website
- Instruct150K Dataset: [Link to dataset or reference]
- OpenAI CLIP: Learn More
- LLAVA 1.5 https://github.com/haotian-liu
- Microsoft Phi-2