skit-ai
/

speechllm-2B

Feature Extraction

speech-language

Model card Files Files and versions Community

speechllm-2B / README.md

shangeth's picture

Update README.md

581af0f verified 7 months ago

|

3.64 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	datasets:
	- mozilla-foundation/common_voice_16_1
	- openslr/librispeech_asr
	metrics:
	- wer
	- accuracy
	model-index:
	- name: SpeechLLM
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: LibriSpeech (clean)
	type: librispeech_asr
	config: clean
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 7.3
	name: Test WER
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: LibriSpeech (other)
	type: librispeech_asr
	config: other
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 10.47
	name: Test WER
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Common Voice 16.1
	type: common_voice_16_1
	split: test
	args:
	language: en
	metrics:
	- type: wer
	value: 24.47
	name: Test WER
	- task:
	type: audio-classification
	name: Audio Classification
	dataset:
	name: Common Voice 16.1
	type: common_voice_16_1
	split: test
	args:
	language: en
	metrics:
	- type: accuracy
	value: 60.61
	name: Test Age Accuracy
	- task:
	type: audio-classification
	name: Audio Classification
	dataset:
	name: Common Voice 16.1
	type: common_voice_16_1
	split: test
	args:
	language: en
	metrics:
	- type: accuracy
	value: 61.56
	name: Test Accent Accuracy
	---

	# SpeechLLM

	[The model is still training, we will be releasing the latest checkpoints soon...]

	SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following:
	1. SpeechActivity : if the audio signal contains speech (True/False)
	2. Transcript : ASR transcript of the audio
	3. Gender of the speaker (Female/Male)
	4. Age of the speaker (Young/Middle-Age/Senior)
	5. Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
	6. Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

	## Usage
	```python
	# Load model directly from huggingface
	from transformers import AutoModel
	model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

	model.generate_meta(
	audio_path="path-to-audio.wav",
	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
	max_new_tokens=500,
	return_special_tokens=False
	)

	# Model Generation
	'''
	{
	"SpeechActivity" : "True",
	"Transcript": "Yes, I got it. I'll make the payment now.",
	"Gender": "Female",
	"Emotion": "Neutral",
	"Age": "Young",
	"Accent" : "America",
	}
	'''
	```

	## Model Details

	- Model Size : 2.1 B
	- Checkpoint : 2000 k steps (bs=1)
	- Adapters : r=4, alpha=8
	- lr = 1e-4
	- gradient accumulation steps : 8


	## Checkpoint Result

	\| Dataset \| Word Error Rate \| Gender Acc \| Age Acc \| Accent Acc \|
	\|:----------------------:\|:----------------------:\|:-------------:\|:----------:\|:-------------:\|
	\| librispeech-test-clean \| 0.0736 \| 0.9490 \| \| \|
	\| librispeech-test-other \| 0.1047 \| 0.9099 \| \| \|
	\| CommonVoice test \| 0.2447 \| 0.8680 \| 0.6061 \| 0.6156 \|