Sijuade commited on
Commit
300318a
1 Parent(s): ce0165c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -1
README.md CHANGED
@@ -10,4 +10,44 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: mit
11
  ---
12
 
13
+ # Multimodal AI Assistant
14
+
15
+ This Gradio app, hosted on HuggingFace Spaces, is an interactive demonstration of a multimodal AI system capable of processing image, text, and audio inputs to generate descriptive text outputs.
16
+
17
+ ## Overview
18
+
19
+ The app integrates advanced AI models to analyze and interpret various types of data, showcasing the power of multimodal AI systems in understanding and generating human-like descriptions.
20
+
21
+ ### Features
22
+
23
+ - **Image Input**: Upload an image to receive a descriptive text output.
24
+ - **Text Input**: Enter text to see how the AI interprets and describes it.
25
+ - **Audio Input**: Using Whisperx technology, the app can process spoken language inputs and generate text descriptions.
26
+
27
+ ## How to Use
28
+
29
+ 1. **Upload or Enter Data**: Upload your image or audio file, or type your text into the provided field.
30
+ 2. **Submit for Analysis**: Click the 'Submit' button to send your input to the AI model.
31
+ 3. **View Output**: The descriptive text output will be displayed on the screen.
32
+
33
+ ## Technologies Used
34
+
35
+ - **AI Model**: A state-of-the-art model capable of understanding and generating text from various inputs.
36
+ - **Whisperx**: Advanced audio processing for converting speech to text.
37
+ - **Gradio**: An easy-to-use interface for creating customizable ML demos.
38
+
39
+ ## Contact
40
+
41
+ Training and implementation code is on (github)[https://github.com/Delve-ERAV1/Phi-2-Vision-Language/tree/main/Phi-2-CLIP-LLAVA-Instruction-Finetuning].
42
+
43
+
44
+ ## Acknowledgements
45
+
46
+ This project is inspired by advancements in multimodal AI, particularly the Large Language and Vision Assistant (LLaVA) project. Our implementation demonstrates the integration of language, vision, and audio processing in one coherent system.
47
+
48
+
49
+ COCO 2017 Dataset: Website
50
+ Instruct150K Dataset: [Link to dataset or reference]
51
+ OpenAI CLIP: Learn More
52
+ LLAVA 1.5 https://github.com/haotian-liu
53
+ Microsoft Phi-2