metadata

license: apache-2.0
datasets:
  - mozilla-foundation/common_voice_16_1
  - openslr/librispeech_asr
language:
  - en
metrics:
  - wer
library_name: transformers
model-index:
  - name: SpeechLLM
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: clean
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 12.3
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 18.9
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 16.1
          type: common_voice_16_1
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 25.01

SpeechLLM

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following:

SpeechActivity : if the audio signal contains speech (True/False)
Transcript : ASR transcript of the audio
Gender of the speaker (Female/Male)
Age of the speaker (Young/Middle-Age/Senior)
Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

Usage

# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", 
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500, 
    return_special_tokens=False
)

# Model Generation
'''
{ "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
    "Accent" : "America",
    }
'''

Checkpoint Result

Dataset	Word Error Rate(%)	Gender(%)
librispeech-test-clean	0.1230	0.8778
librispeech-test-other	0.1890	0.8908
CommonVoice test	0.2501	0.8753