File size: 5,032 Bytes
49babd2
7eacacf
 
49babd2
7eacacf
8252cc5
 
 
49babd2
d9290bd
 
c59fd46
b8490f7
49babd2
 
8036924
d9290bd
 
 
7eacacf
 
 
 
 
 
 
 
 
 
 
 
a1cd3c4
7eacacf
 
 
 
 
 
 
 
 
 
 
 
 
a1cd3c4
7eacacf
 
 
 
 
 
 
 
 
 
 
 
8036924
7eacacf
58a9e2e
 
 
 
 
87e8c4b
58a9e2e
 
 
 
 
 
 
b8490f7
 
 
 
 
 
 
 
 
 
 
 
 
8036924
 
 
 
 
 
 
 
 
 
 
a1cd3c4
8036924
 
a1cd3c4
8036924
1b27420
 
 
 
d6b4cc8
 
5813766
60d6884
 
 
 
 
 
183a43c
1b27420
 
 
 
581af0f
1b27420
 
ce68ebb
6a10328
1b27420
 
 
 
 
 
 
477312d
 
1b27420
 
 
 
477312d
 
1b27420
 
 
198fbb9
 
74cd139
 
ece6d31
4892e48
393959a
 
ece6d31
 
 
 
 
9c1f1a8
74cd139
1b27420
 
e87dbd8
 
 
 
 
23d1400
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- multi-modal
- speech-language
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
- MLCommons/ml_spoken_words
- Ar4ikov/iemocap_audio_text_splitted
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 6.73
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 9.13
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 24.47
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: ML Spoken Words
      type: MLCommons/ml_spoken_words
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 36.12
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: IEMOCAP
      type: Ar4ikov/iemocap_audio_text_splitted
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 44.15
      name: Test WER
  - task:
      type: audio-classification
      name: Audio Classification
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: accuracy
      value: 62.51
      name: Test Age Accuracy
    - type: accuracy
      value: 64.57
      name: Test Accent Accuracy
---

# SpeechLLM

![](./speechllm.png)

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

model.generate_meta(
	audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
	max_new_tokens=500, 
	return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''
```

Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).

## Model Details

- **Developed by:** Skit AI
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma)
- **Language:** English
- **Finetuned from model:** [HubertX](https://huggingface.co/facebook/hubert-xlarge-ll60k), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- **Model Size:** 2.1 B
- **Checkpoint:** 2000 k steps (bs=1)
- **Adapters:** r=4, alpha=8
- **lr** : 1e-4
- **gradient accumulation steps:** 8


## Checkpoint Result

|         **Dataset**        |       **Type**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
| **librispeech-test-clean** | Read Speech         |         6.73        |     0.9536     |             |                |
| **librispeech-test-other** | Read Speech         |         9.13        |     0.9099     |             |                |
| **CommonVoice test**       | Diverse Accent, Age |        24.27        |     0.8680     |    0.6251   |     0.6457     |
| **ML Spoken Words test**   | Short Utterance     |        36.12        |     0.6587     |             |                |
| **IEMOCAP test**           | Emotional Speech    |        44.15        |     0.7557     |             |                |