Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
@@ -9,36 +9,115 @@ app_file: app.py
|
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
## Phi2 : Multimodal Finetuning
|
13 |
### Details
|
14 |
-
1. LLM Backbone: Phi2
|
15 |
-
2. Vision Tower: clip-vit-large-patch14-336
|
16 |
-
3. Audio Model: Whisper
|
17 |
-
4. Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
|
18 |
-
5. Finetuning Dataset: Instruct 150k dataset based on COCO
|
19 |
|
20 |
-
|
21 |
-
|
|
|
22 |
|
23 |
-
|
24 |
-
#### Training Loss Curve
|
25 |
-
![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/b6c37a95-0a56-4b52-8719-3ff56dc1b703)
|
26 |
|
27 |
-
|
28 |
-
|
|
|
|
|
29 |
|
30 |
-
|
31 |
-
|
|
|
|
|
|
|
32 |
|
33 |
-
### Finetuning
|
34 |
-
#### Training Loss Curve
|
35 |
-
![image](https://github.com/RaviNaik/ERA-CAPSTONE/assets/23289802/45ef40bd-fae5-4cfe-a522-c0eed2833230)
|
36 |
|
37 |
-
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
#### Training Logs
|
41 |
-
![image](https://github.com/
|
|
|
42 |
|
43 |
### Results
|
44 |
-
![image](https://github.com/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
---
|
12 |
+
# ERA-CAPSTONE
|
13 |
+
|
14 |
+
🤗[**Space Link**](https://huggingface.co/spaces/GunaKoppula/MultiModal-Phi2)
|
15 |
+
|
16 |
+
|
17 |
+
### Tasks:
|
18 |
+
|
19 |
+
1. Make a multi-modal LLM that can take these inputs:
|
20 |
+
|
21 |
+
- :heavy_check_mark: Text
|
22 |
+
- :heavy_check_mark: Image
|
23 |
+
- :heavy_check_mark: Audio
|
24 |
+
|
25 |
+
2. Training:
|
26 |
+
|
27 |
+
- Image:
|
28 |
+
|
29 |
+
:heavy_check_mark: Use the original Instruct 150k dataset, and use CLIP to get the image embeddings.
|
30 |
+
|
31 |
+
:heavy_check_mark: Add projection layer from this CLIP embeddings to something that can be fed to Phi Model.
|
32 |
+
|
33 |
+
:heavy_check_mark: Add an adapter to train (QLoRa) on the instruct 150k dataset.
|
34 |
+
|
35 |
+
- Audio:
|
36 |
+
|
37 |
+
:heavy_check_mark: Need to use Whisper to perform ASR.
|
38 |
+
|
39 |
+
:heavy_check_mark: Add a projection layer for whisper output.
|
40 |
+
|
41 |
+
- Text:
|
42 |
+
|
43 |
+
:heavy_check_mark: Give any text to generate the related details.
|
44 |
+
|
45 |
+
|
46 |
+
3. :heavy_check_mark: The output remains text, based on multimodal inputs - text, image, and audio.
|
47 |
+
|
48 |
+
4. :heavy_check_mark: The deployment page should look like ChatGPT only, where we can send in images, text, or upload audio (live recording or file).
|
49 |
+
|
50 |
+
|
51 |
+
|
52 |
+
## Phi2 : Pretraining LLM from Scratch
|
53 |
+
### Details
|
54 |
+
1. Model used: [Microsoft Phi2](https://huggingface.co/microsoft/phi-2)
|
55 |
+
2. Dataset used: Tiny Stories dataset(100k samples) & Realtime data(100k samples) from finetuned Phi2 model via Ollama
|
56 |
+
3. Pretraining approach: Pretraining using QLoRA
|
57 |
+
|
58 |
+
|
59 |
+
### Training Loss Curve
|
60 |
+
<img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/1692461c-de43-4b50-87d7-bdc0d72b5f69.type" width="500">
|
61 |
+
|
62 |
+
### Training Logs
|
63 |
+
![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/2672a350-7786-4773-b1bc-ea99a3e7091c)
|
64 |
+
|
65 |
+
|
66 |
## Phi2 : Multimodal Finetuning
|
67 |
### Details
|
68 |
+
1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2)
|
69 |
+
2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
|
70 |
+
3. Audio Model: [Whisper Tiny](https://huggingface.co/openai/whisper-tiny)
|
71 |
+
4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
|
72 |
+
5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
|
73 |
|
74 |
+
```python
|
75 |
+
class AudioLanguageConnector:
|
76 |
+
```
|
77 |
|
78 |
+
- This class prepares and tokenizes audio-related text data using the "microsoft/phi-2" model's tokenizer. The <audio_start> and <audio_end> tokens are added to the input text to provide context for audio-related processing. The tokenized output is then returned as a tensor. This class acts as a connector to process text data in a format suitable for the specified audio model.
|
|
|
|
|
79 |
|
80 |
+
```python
|
81 |
+
class WhisperWithProjection:
|
82 |
+
```
|
83 |
+
- This class transcribes audio by encapsulating the necessary steps. It uses a pre-trained model called "openai/whisper-tiny" to convert audio files into text transcriptions.
|
84 |
|
85 |
+
```python
|
86 |
+
class MultiModalPhi2:
|
87 |
+
```
|
88 |
+
- This class takes input text, audio, and images and constructs a conversation prompt with appropriate formatting for the model. It tokenizes the prompt, preprocesses the image, and concatenates audio embeddings if available, and generates new tokens using the pre-trained model, considering input modalities.
|
89 |
+
Decodes and returns the generated output, handling special tokens and potential mismatches.
|
90 |
|
|
|
|
|
|
|
91 |
|
92 |
+
|
93 |
+
### Pretraining
|
94 |
+
#### Training Loss Curve and Learning Rate
|
95 |
+
<img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/c9e205b9-44aa-4ef3-b7da-f6d69b5f0f2a.type" width="400"> <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/a82cf4b6-0cc4-47d9-ad7e-f504677a5074.type" width="393">
|
96 |
+
|
97 |
+
#### Training Logs
|
98 |
+
![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/83cbd14a-9626-410c-99be-5757c089c9b2)
|
99 |
+
|
100 |
+
### Finetuning
|
101 |
+
#### Training Loss Curve and Learning Rate
|
102 |
+
<img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/ceb9d187-14cb-4372-8370-bdbf7f7a3812.type" width="388"> <img src="https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/5d1fe7b3-5cec-46c8-aaac-a1e3ae5b7f6c.type" width="400">
|
103 |
|
104 |
#### Training Logs
|
105 |
+
![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/3aebd889-d120-466f-8751-9c7e37023ab1)
|
106 |
+
|
107 |
|
108 |
### Results
|
109 |
+
![image](https://github.com/GunaKoppula/ERAV1-CAPSTONE/assets/61241928/4b54c0bd-b078-4dc9-932a-49640d0297dc)
|
110 |
+
|
111 |
+
|
112 |
+
### Deployed on HF
|
113 |
+
#### Text & Image:
|
114 |
+
|
115 |
+
|
116 |
+
#### Audio & Image:
|
117 |
+
**Question Asked: Tell me about this image**
|
118 |
+
|
119 |
+
|
120 |
+
|
121 |
+
### Future Scope:
|
122 |
+
- Incorporating the original Llava model's finetuning on a larger set of BLIP captions (558k) could lead to significant improvements.
|
123 |
+
- Using GPTQ or AWQ can reduce latency, making the model more efficient.
|