--- language: - en license: apache-2.0 library_name: transformers datasets: - mozilla-foundation/common_voice_16_1 - openslr/librispeech_asr - MLCommons/ml_spoken_words metrics: - wer - accuracy model-index: - name: SpeechLLM results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - type: wer value: 6.73 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - type: wer value: 9.13 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: wer value: 24.47 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: ML Spoken Words type: MLCommons/ml_spoken_words split: test args: language: en metrics: - type: wer value: 36.12 name: Test WER - task: type: audio-classification name: Audio Classification dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: accuracy value: 62.51 name: Test Age Accuracy - type: accuracy value: 64.57 name: Test Accent Accuracy --- # SpeechLLM [The model is still training, we will be releasing the latest checkpoints soon...] SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following: 1. **SpeechActivity** : if the audio signal contains speech (True/False) 2. **Transcript** : ASR transcript of the audio 3. **Gender** of the speaker (Female/Male) 4. **Age** of the speaker (Young/Middle-Age/Senior) 5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) 6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated) ## Usage ```python # Load model directly from huggingface from transformers import AutoModel model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True) model.generate_meta( audio_path="path-to-audio.wav", #16k Hz, mono instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", max_new_tokens=500, return_special_tokens=False ) # Model Generation ''' { "SpeechActivity" : "True", "Transcript": "Yes, I got it. I'll make the payment now.", "Gender": "Female", "Emotion": "Neutral", "Age": "Young", "Accent" : "America", } ''' ``` Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing). ## Model Details - **Developed by:** Skit AI - **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma) - **Language:** English - **Finetuned from model:** [HubertX](https://huggingface.co/facebook/hubert-xlarge-ll60k), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - **Model Size:** 2.1 B - **Checkpoint:** 2000 k steps (bs=1) - **Adapters:** r=4, alpha=8 - **lr** : 1e-4 - **gradient accumulation steps:** 8 ## Checkpoint Result | **Dataset** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** | |:----------------------:|:----------------------:|:-------------:|:----------:|:-------------:| | librispeech-test-clean | 6.73 | 0.9536 | | | | librispeech-test-other | 9.13 | 0.9099 | | | | CommonVoice test | 24.27 | 0.8680 | 0.6251 | 0.6457 |