|
--- |
|
license: mit |
|
datasets: |
|
- ThinkingMachinesDataScience/Ratchada-STT |
|
language: |
|
- en |
|
- th |
|
metrics: |
|
- wer |
|
- cer |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- finance |
|
--- |
|
|
|
# Ratchada-Fang-Thon-Whisper |
|
|
|
## Model Description |
|
|
|
Ratchada-Fang-Thon-Whisper is a fine-tuned version of the Whisper model, specifically adapted for Thai speech recognition in financial contexts. This model is designed to transcribe Thai audio with high accuracy, particularly for financial terminology and discussions. |
|
|
|
![Image](https://huggingface.co/ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper/resolve/main/tools.jpg) |
|
|
|
[Whisper](https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013) is a state-of-the-art transformer model that can transcribe speech signals into text with high accuracy and low latency. We will use the huggingface's whisper implementation to fine-tune the model on our own GPU infrastructure, using a various custom dataset of audio recordings and transcripts. |
|
|
|
We will also monitor the training process and evaluate the model performance with tensorboard, a visualization tool for machine learning experiments. |
|
|
|
### Key Features |
|
|
|
- Specialized in Thai language transcription |
|
- Fine-tuned for financial domain vocabulary |
|
- Based on the Whisper medium model architecture |
|
- Supports long-form transcription |
|
|
|
### Model Details |
|
|
|
- Model Type: WhisperForConditionalGeneration |
|
- Language: Thai |
|
- Task: Automatic Speech Recognition (ASR) |
|
- License: MIT |
|
|
|
## Usage |
|
|
|
### Standard Pipeline (Recommended) |
|
|
|
You can use this model with the standard Transformers pipeline: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
device = 0 if torch.cuda.is_available() else "cpu" |
|
|
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model="ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper", |
|
device=device, |
|
generate_kwargs={"language": "th", "task": "transcribe"} |
|
) |
|
|
|
result = pipe("path/to/audio/file.wav") # path to audio file or numpy array of wave |
|
print(result["text"]) |
|
``` |
|
|
|
**Note**: It is recommended that audio input should have **sample_rate=16_000** before hand ! |
|
|
|
### Transformer Directly |
|
|
|
You can use this model from Transfomers module driectly: |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq |
|
import torch |
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
processor = AutoProcessor.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper") |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper").to(device) |
|
|
|
# waveform is numpy that obtain from Audio processor lib i.e. librosa, torchaudio |
|
|
|
input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features.to(device) |
|
|
|
with torch.no_grad(): |
|
predicted_ids = model.generate(input_features) |
|
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] # best choice of batches |
|
|
|
from ratchada_processor import tokenize_text # strongly recommend post-processor |
|
|
|
processed_text = tokenize_text(transcription) # cut the text into splited component and process it (see github) |
|
|
|
result = "".join(processed_text) |
|
|
|
print(result) |
|
``` |
|
|
|
**Note**: Using this method required own manually post-processor at the output of the model. The post-processor can be found in this lib on pypi project: |
|
|
|
```bash |
|
python3 -m pip install ratchada-util |
|
``` |
|
|
|
## Training |
|
|
|
### Training Data |
|
|
|
This model was fine-tuned on a proprietary dataset: ThinkingMachinesDataScience/Ratchada-STT. The dataset contains Thai speech audio from financial contexts. |
|
|
|
### Training Procedure |
|
|
|
The model was fine-tuned from the biodatlab/whisper-th-medium-combined checkpoint, which is a Thai-specific version of the Whisper medium model. |
|
After each model prediction, a post-processor code is applied to refine the results. |
|
|
|
## Limitations and Bias |
|
|
|
1. The model is specifically trained on Thai financial audio data and may not perform as well on general Thai speech or other domains. |
|
2. There might be biases present in the training data, which could affect the model's performance on certain types of speech or accents. |
|
|
|
## Evaluation Results |
|
|
|
Using our **own evaluation algorithm**, these are the performance of this model: |
|
* Lower is better |
|
|
|
| models | wer | cer (jiwer) | deletions | substitutions | insertions | |
|
|----------------------|-------------|-------------|-----------|---------------|-------------| |
|
| **RATFT-WHISPER** | **0.332685** | **0.272674** | 1884 | 1806 | 5466 | |
|
| WHISPER-LARGE-V3 | 0.392162 | 0.318666 | 2499 | 1489 | 6752 | |
|
| THON-WHISPER | 0.474360 | 0.405920 | 1722 | 2603 | 8597 | |
|
| WHISPER-LARGE | 0.593637 | 0.578926 | 5441 | 1500 | 9433 | |
|
| WHISPER-LARGE-V2 | 0.595292 | 0.652592 | 4924 | 1866 | 9580 | |
|
| WHISPER-MEDIUM | 0.643084 | 0.66565 | 7471 | 1312 | 9090 | |
|
| WHISPER-SMALL | 0.667453 | 0.603361 | 4397 | 1817 | 12028 | |
|
| WHISPER-BASE | 0.791954 | 0.73896 | 3362 | 1906 | 16252 | |
|
|
|
**Note**: CER, Using [Jiwer](https://pypi.org/project/jiwer/), to evaluate an automatic speech recognition system. |
|
|
|
## Ethical Considerations |
|
Users should be aware that this model is designed for transcribing Thai speech in financial contexts. It should not be used for making financial decisions without human verification. Always cross-check important financial information obtained from this model. |
|
|
|
## Citations |
|
If you use this model in your research, please cite: |
|
``` |
|
Copy@misc{Ratchada-Fang-Thon-Whisper, |
|
author = {ThinkingMachinesDataScience}, |
|
title = {Ratchada-Fang-Thon-Whisper: Thai Financial Speech Recognition Model}, |
|
year = {2023}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://huggingface.co/ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper}} |
|
} |
|
``` |
|
|
|
## Contacts |
|
For questions and feedback about this model, please make a contact [ThinkingMachinesDataScience](https://github.com/thinkingmachines/set-speechtotext-poc) Github repository for this project. |