File size: 3,993 Bytes
49babd2
7eacacf
 
49babd2
7eacacf
49babd2
d9290bd
 
49babd2
 
8036924
d9290bd
 
 
7eacacf
 
 
 
 
 
 
 
 
 
 
 
8036924
7eacacf
 
 
 
 
 
 
 
 
 
 
 
 
8036924
7eacacf
 
 
 
 
 
 
 
 
 
 
 
8036924
7eacacf
8036924
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b27420
 
 
 
64a7de7
 
bb8379e
60d6884
 
 
 
 
 
183a43c
1b27420
 
 
 
581af0f
1b27420
 
 
 
 
 
 
 
 
 
477312d
 
1b27420
 
 
 
477312d
 
1b27420
 
 
74cd139
 
ece6d31
4892e48
393959a
 
ece6d31
 
 
 
 
9c1f1a8
74cd139
1b27420
 
a36c273
8386474
 
 
ece6d31
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
- en
license: apache-2.0
library_name: transformers
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 7.3
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 10.47
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 24.47
      name: Test WER
  - task:
      type: audio-classification
      name: Audio Classification
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: accuracy
      value: 60.61
      name: Test Age Accuracy
  - task:
      type: audio-classification
      name: Audio Classification
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: accuracy
      value: 61.56
      name: Test Accent Accuracy
---

# SpeechLLM

[The model is still training, we will be releasing the latest checkpoints soon...]

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

model.generate_meta(
	audio_path="path-to-audio.wav", 
	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
	max_new_tokens=500, 
	return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''
```

## Model Details

- **Developed by:** Skit AI
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma)
- **Language:** English
- **Finetuned from model:** [HubertX](https://huggingface.co/facebook/hubert-xlarge-ll60k), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- **Model Size:** 2.1 B
- **Checkpoint:** 2000 k steps (bs=1)
- **Adapters:** r=4, alpha=8
- **lr** : 1e-4
- **gradient accumulation steps:** 8


## Checkpoint Result

|       **Dataset**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
|:----------------------:|:----------------------:|:-------------:|:----------:|:-------------:|
| librispeech-test-clean | 0.0736                 | 0.9490        |            |               |
| librispeech-test-other | 0.1047                 | 0.9099        |            |               |
| CommonVoice test       | 0.2447                 | 0.8680        | 0.6061     | 0.6156        |