|
--- |
|
language: |
|
- bm |
|
license: apache-2.0 |
|
base_model: oza75/whisper-bambara-asr-001 |
|
tags: |
|
- asr |
|
- generated_from_trainer |
|
datasets: |
|
- oza75/bambara-tts |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: Whisper Medium Bambara |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Bambara voices |
|
type: oza75/bambara-tts |
|
metrics: |
|
- name: Wer |
|
type: wer |
|
value: 5.400219298245614 |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Whisper Medium Bambara |
|
|
|
This model is a fine-tuned version of [oza75/whisper-bambara-asr-001](https://huggingface.co/oza75/whisper-bambara-asr-001) on the Bambara voices dataset. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.0646 |
|
- Wer: 5.4002 |
|
|
|
### Usage |
|
To use this model, we first need to define a Tokenizer class because the default Whisper tokenizer does not support Bambara. |
|
|
|
**IMPORTANT: The following code will also override the Whisper tokenizer's LANGUAGES constants. This is not the ideal approach, but it is effective. If you do not make this modification, the generation process will fail.** |
|
```python |
|
from typing import List |
|
|
|
from tokenizers import AddedToken |
|
from transformers import WhisperTokenizer, WhisperProcessor |
|
import transformers.models.whisper.tokenization_whisper as whisper_tokenization |
|
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE, TASK_IDS |
|
|
|
CUSTOM_TO_LANGUAGE_CODE = {**TO_LANGUAGE_CODE, "bambara": "bm"} |
|
|
|
# IMPORTANT: We update the whisper tokenizer constants to add Bambara Language. Not ideal but at least it works |
|
whisper_tokenization.TO_LANGUAGE_CODE.update(CUSTOM_TO_LANGUAGE_CODE) |
|
|
|
|
|
class BambaraWhisperTokenizer(WhisperTokenizer): |
|
def __init__(self, *args, **kwargs): |
|
super().__init__(*args, **kwargs) |
|
self.add_tokens(AddedToken(content="<|bm|>", lstrip=False, rstrip=False, normalized=False, special=True)) |
|
|
|
@property |
|
def prefix_tokens(self) -> List[int]: |
|
bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>") |
|
translate_token_id = self.convert_tokens_to_ids("<|translate|>") |
|
transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>") |
|
notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>") |
|
|
|
if self.language is not None: |
|
self.language = self.language.lower() |
|
if self.language in CUSTOM_TO_LANGUAGE_CODE: |
|
language_id = CUSTOM_TO_LANGUAGE_CODE[self.language] |
|
elif self.language in CUSTOM_TO_LANGUAGE_CODE.values(): |
|
language_id = self.language |
|
else: |
|
is_language_code = len(self.language) == 2 |
|
raise ValueError( |
|
f"Unsupported language: {self.language}. Language should be one of:" |
|
f" {list(CUSTOM_TO_LANGUAGE_CODE.values()) if is_language_code else list(CUSTOM_TO_LANGUAGE_CODE.keys())}." |
|
) |
|
|
|
if self.task is not None: |
|
if self.task not in TASK_IDS: |
|
raise ValueError(f"Unsupported task: {self.task}. Task should be in: {TASK_IDS}") |
|
|
|
bos_sequence = [bos_token_id] |
|
if self.language is not None: |
|
bos_sequence.append(self.convert_tokens_to_ids(f"<|{language_id}|>")) |
|
if self.task is not None: |
|
bos_sequence.append(transcribe_token_id if self.task == "transcribe" else translate_token_id) |
|
if not self.predict_timestamps: |
|
bos_sequence.append(notimestamps_token_id) |
|
return bos_sequence |
|
|
|
``` |
|
|
|
Then, we can define the pipeline: |
|
|
|
```python |
|
import torch |
|
from transformers import pipeline |
|
|
|
# Determine the appropriate device (GPU or CPU) |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
# Define the model checkpoint and language |
|
model_checkpoint = "oza75/whisper-bambara-asr-001" |
|
language = "bambara" |
|
|
|
# Load the custom tokenizer designed for Bambara and the ASR model |
|
tokenizer = BambaraWhisperTokenizer.from_pretrained(model_checkpoint, language=language, device=device) |
|
pipe = pipeline(model=model_checkpoint, tokenizer=tokenizer, device=device) |
|
|
|
def transcribe(audio): |
|
""" |
|
Transcribes the provided audio file into text using the configured ASR pipeline. |
|
|
|
Args: |
|
audio: The path to the audio file to transcribe. |
|
|
|
Returns: |
|
A string representing the transcribed text. |
|
""" |
|
# Use the pipeline to perform transcription |
|
text = pipe(audio)["text"] |
|
return text |
|
|
|
|
|
transcribe(path_to_the_audio) |
|
|
|
``` |
|
|
|
## Intended uses & limitations |
|
|
|
This checkpoint is intended to be used **ONLY for research purposes !!!** |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 8e-06 |
|
- train_batch_size: 64 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 500 |
|
- num_epochs: 10 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Wer | |
|
|:-------------:|:------:|:----:|:---------------:|:-------:| |
|
| 0.0733 | 0.4032 | 25 | 0.0621 | 6.4145 | |
|
| 0.0625 | 0.8065 | 50 | 0.0576 | 7.0724 | |
|
| 0.0631 | 1.2097 | 75 | 0.0554 | 7.2094 | |
|
| 0.0371 | 1.6129 | 100 | 0.0549 | 7.3739 | |
|
| 0.0453 | 2.0161 | 125 | 0.0533 | 10.1425 | |
|
| 0.0244 | 2.4194 | 150 | 0.0548 | 7.5658 | |
|
| 0.0231 | 2.8226 | 175 | 0.0582 | 7.6206 | |
|
| 0.0159 | 3.2258 | 200 | 0.0577 | 6.2226 | |
|
| 0.0097 | 3.6290 | 225 | 0.0581 | 7.5932 | |
|
| 0.0071 | 4.0323 | 250 | 0.0590 | 7.3739 | |
|
| 0.0042 | 4.4355 | 275 | 0.0609 | 6.0033 | |
|
| 0.0066 | 4.8387 | 300 | 0.0610 | 5.1809 | |
|
| 0.0042 | 5.2419 | 325 | 0.0600 | 7.2368 | |
|
| 0.0036 | 5.6452 | 350 | 0.0622 | 8.6623 | |
|
| 0.0084 | 6.0484 | 375 | 0.0738 | 6.6886 | |
|
| 0.0087 | 6.4516 | 400 | 0.0677 | 7.2643 | |
|
| 0.0077 | 6.8548 | 425 | 0.0748 | 7.4013 | |
|
| 0.0082 | 7.2581 | 450 | 0.0751 | 8.0318 | |
|
| 0.0097 | 7.6613 | 475 | 0.0719 | 8.1963 | |
|
| 0.0114 | 8.0645 | 500 | 0.0746 | 8.3607 | |
|
| 0.0071 | 8.4677 | 525 | 0.0691 | 6.8805 | |
|
| 0.0075 | 8.8710 | 550 | 0.0659 | 6.0581 | |
|
| 0.0034 | 9.2742 | 575 | 0.0647 | 5.4002 | |
|
| 0.0032 | 9.6774 | 600 | 0.0646 | 5.4002 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.40.1 |
|
- Pytorch 2.2.0+cu121 |
|
- Datasets 2.19.0 |
|
- Tokenizers 0.19.1 |
|
|