|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- mozilla-foundation/common_voice_16_1 |
|
- openslr/librispeech_asr |
|
language: |
|
- en |
|
metrics: |
|
- wer |
|
library_name: transformers |
|
model-index: |
|
- name: SpeechLLM |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 12.3 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 18.9 |
|
|
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 16.1 |
|
type: common_voice_16_1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 25.01 |
|
--- |
|
|
|
# SpeechLLM |
|
|
|
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following: |
|
1. **SpeechActivity** : if the audio signal contains speech (True/False) |
|
2. **Transcript** : ASR transcript of the audio |
|
3. **Gender** of the speaker (Female/Male) |
|
4. **Age** of the speaker (Young/Middle-Age/Senior) |
|
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) |
|
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated) |
|
|
|
## Usage |
|
```python |
|
# Load model directly from huggingface |
|
from transformers import AutoModel |
|
model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True) |
|
|
|
model.generate_meta( |
|
audio_path="path-to-audio.wav", |
|
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", |
|
max_new_tokens=500, |
|
return_special_tokens=False |
|
) |
|
|
|
# Model Generation |
|
''' |
|
{ "SpeechActivity" : "True", |
|
"Transcript": "Yes, I got it. I'll make the payment now.", |
|
"Gender": "Female", |
|
"Emotion": "Neutral", |
|
"Age": "Young", |
|
"Accent" : "America", |
|
} |
|
''' |
|
``` |
|
|
|
## Checkpoint Result |
|
|
|
| Dataset | Word Error Rate(%) | Gender(%) | |
|
|:----------------------:|:------------------:|:---------:| |
|
| librispeech-test-clean | 0.1230 | 0.8778 | |
|
| librispeech-test-other | 0.1890 | 0.8908 | |
|
| CommonVoice test | 0.2501 | 0.8753 | |