File size: 5,032 Bytes
49babd2 7eacacf 49babd2 7eacacf 8252cc5 49babd2 d9290bd c59fd46 b8490f7 49babd2 8036924 d9290bd 7eacacf a1cd3c4 7eacacf a1cd3c4 7eacacf 8036924 7eacacf 58a9e2e 87e8c4b 58a9e2e b8490f7 8036924 a1cd3c4 8036924 a1cd3c4 8036924 1b27420 d6b4cc8 5813766 60d6884 183a43c 1b27420 581af0f 1b27420 ce68ebb 6a10328 1b27420 477312d 1b27420 477312d 1b27420 198fbb9 74cd139 ece6d31 4892e48 393959a ece6d31 9c1f1a8 74cd139 1b27420 e87dbd8 23d1400 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- multi-modal
- speech-language
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
- MLCommons/ml_spoken_words
- Ar4ikov/iemocap_audio_text_splitted
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- type: wer
value: 6.73
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 9.13
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: wer
value: 24.47
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: ML Spoken Words
type: MLCommons/ml_spoken_words
split: test
args:
language: en
metrics:
- type: wer
value: 36.12
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: IEMOCAP
type: Ar4ikov/iemocap_audio_text_splitted
split: test
args:
language: en
metrics:
- type: wer
value: 44.15
name: Test WER
- task:
type: audio-classification
name: Audio Classification
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: accuracy
value: 62.51
name: Test Age Accuracy
- type: accuracy
value: 64.57
name: Test Accent Accuracy
---
# SpeechLLM
![](./speechllm.png)
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav", #16k Hz, mono
audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
```
Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).
## Model Details
- **Developed by:** Skit AI
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma)
- **Language:** English
- **Finetuned from model:** [HubertX](https://huggingface.co/facebook/hubert-xlarge-ll60k), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- **Model Size:** 2.1 B
- **Checkpoint:** 2000 k steps (bs=1)
- **Adapters:** r=4, alpha=8
- **lr** : 1e-4
- **gradient accumulation steps:** 8
## Checkpoint Result
| **Dataset** | **Type** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
| **librispeech-test-clean** | Read Speech | 6.73 | 0.9536 | | |
| **librispeech-test-other** | Read Speech | 9.13 | 0.9099 | | |
| **CommonVoice test** | Diverse Accent, Age | 24.27 | 0.8680 | 0.6251 | 0.6457 |
| **ML Spoken Words test** | Short Utterance | 36.12 | 0.6587 | | |
| **IEMOCAP test** | Emotional Speech | 44.15 | 0.7557 | | | |