Spaces:

Sijuade
/

MLM-CLIP-PHI-2-LLAVA-chatbot

Build error

App Files Files Community

Sijuade commited on Feb 3

Commit

300318a

•

1 Parent(s): ce0165c

Update README.md

Browse files

Files changed (1) hide show

README.md +41 -1

README.md CHANGED Viewed

@@ -10,4 +10,44 @@ pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 license: mit
 ---
+# Multimodal AI Assistant
+This Gradio app, hosted on HuggingFace Spaces, is an interactive demonstration of a multimodal AI system capable of processing image, text, and audio inputs to generate descriptive text outputs.
+## Overview
+The app integrates advanced AI models to analyze and interpret various types of data, showcasing the power of multimodal AI systems in understanding and generating human-like descriptions.
+### Features
+- **Image Input**: Upload an image to receive a descriptive text output.
+- **Text Input**: Enter text to see how the AI interprets and describes it.
+- **Audio Input**: Using Whisperx technology, the app can process spoken language inputs and generate text descriptions.
+## How to Use
+1. **Upload or Enter Data**: Upload your image or audio file, or type your text into the provided field.
+2. **Submit for Analysis**: Click the 'Submit' button to send your input to the AI model.
+3. **View Output**: The descriptive text output will be displayed on the screen.
+## Technologies Used
+- **AI Model**: A state-of-the-art model capable of understanding and generating text from various inputs.
+- **Whisperx**: Advanced audio processing for converting speech to text.
+- **Gradio**: An easy-to-use interface for creating customizable ML demos.
+## Contact
+Training and implementation code is on (github)[https://github.com/Delve-ERAV1/Phi-2-Vision-Language/tree/main/Phi-2-CLIP-LLAVA-Instruction-Finetuning].
+## Acknowledgements
+This project is inspired by advancements in multimodal AI, particularly the Large Language and Vision Assistant (LLaVA) project. Our implementation demonstrates the integration of language, vision, and audio processing in one coherent system.
+COCO 2017 Dataset: Website
+Instruct150K Dataset: [Link to dataset or reference]
+OpenAI CLIP: Learn More
+LLAVA 1.5 https://github.com/haotian-liu
+Microsoft Phi-2