Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.16.1
title: MultiModal Phi2
emoji: π
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.35.2
app_file: app.py
pinned: false
license: mit
ERA-CAPSTONE
π€Space Link
Tasks:
Make a multi-modal LLM that can take these inputs:
- :heavy_check_mark: Text
- :heavy_check_mark: Image
- :heavy_check_mark: Audio
Training:
Image:
:heavy_check_mark: Use the original Instruct 150k dataset, and use CLIP to get the image embeddings.
:heavy_check_mark: Add projection layer from this CLIP embeddings to something that can be fed to Phi Model.
:heavy_check_mark: Add an adapter to train (QLoRa) on the instruct 150k dataset.
Audio:
:heavy_check_mark: Need to use Whisper to perform ASR.
:heavy_check_mark: Add a projection layer for whisper output.
Text:
:heavy_check_mark: Give any text to generate the related details.
:heavy_check_mark: The output remains text, based on multimodal inputs - text, image, and audio.
:heavy_check_mark: The deployment page should look like ChatGPT only, where we can send in images, text, or upload audio (live recording or file).
Phi2 : Pretraining LLM from Scratch
Details
- Model used: Microsoft Phi2
- Dataset used: Tiny Stories dataset(100k samples) & Realtime data(100k samples) from finetuned Phi2 model via Ollama
- Pretraining approach: Pretraining using QLoRA
Training Loss Curve
Training Logs
Phi2 : Multimodal Finetuning
Details
- LLM Backbone: Phi2
- Vision Tower: clip-vit-large-patch14-336
- Audio Model: Whisper Tiny
- Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
- Finetuning Dataset: Instruct 150k dataset based on COCO
class AudioLanguageConnector:
- This class prepares and tokenizes audio-related text data using the "microsoft/phi-2" model's tokenizer. The and tokens are added to the input text to provide context for audio-related processing. The tokenized output is then returned as a tensor. This class acts as a connector to process text data in a format suitable for the specified audio model.
class WhisperWithProjection:
- This class transcribes audio by encapsulating the necessary steps. It uses a pre-trained model called "openai/whisper-tiny" to convert audio files into text transcriptions.
class MultiModalPhi2:
- This class takes input text, audio, and images and constructs a conversation prompt with appropriate formatting for the model. It tokenizes the prompt, preprocesses the image, and concatenates audio embeddings if available, and generates new tokens using the pre-trained model, considering input modalities. Decodes and returns the generated output, handling special tokens and potential mismatches.
Pretraining
Training Loss Curve and Learning Rate
Training Logs
Finetuning
Training Loss Curve and Learning Rate
Training Logs
Results
Deployed on HF
Text & Image:
Audio & Image:
Question Asked: Tell me about this image
Future Scope:
- Incorporating the original Llava model's finetuning on a larger set of BLIP captions (558k) could lead to significant improvements.
- Using GPTQ or AWQ can reduce latency, making the model more efficient.