zoon-patcharawiwatpong

Update README.md

93a127b verified 4 months ago

6.22 kB

	---
	license: mit
	datasets:
	- ThinkingMachinesDataScience/Ratchada-STT
	language:
	- en
	- th
	metrics:
	- wer
	- cer
	pipeline_tag: automatic-speech-recognition
	tags:
	- finance
	---

	# Ratchada-Fang-Thon-Whisper

	## Model Description

	Ratchada-Fang-Thon-Whisper is a fine-tuned version of the Whisper model, specifically adapted for Thai speech recognition in financial contexts. This model is designed to transcribe Thai audio with high accuracy, particularly for financial terminology and discussions.

	![Image](https://huggingface.co/ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper/resolve/main/tools.jpg)

	[Whisper](https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013) is a state-of-the-art transformer model that can transcribe speech signals into text with high accuracy and low latency. We will use the huggingface's whisper implementation to fine-tune the model on our own GPU infrastructure, using a various custom dataset of audio recordings and transcripts.

	We will also monitor the training process and evaluate the model performance with tensorboard, a visualization tool for machine learning experiments.

	### Key Features

	- Specialized in Thai language transcription
	- Fine-tuned for financial domain vocabulary
	- Based on the Whisper medium model architecture
	- Supports long-form transcription

	### Model Details

	- Model Type: WhisperForConditionalGeneration
	- Language: Thai
	- Task: Automatic Speech Recognition (ASR)
	- License: MIT

	## Usage

	### Standard Pipeline (Recommended)

	You can use this model with the standard Transformers pipeline:

	```python
	from transformers import pipeline

	device = 0 if torch.cuda.is_available() else "cpu"

	pipe = pipeline(
	"automatic-speech-recognition",
	model="ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper",
	device=device,
	generate_kwargs={"language": "th", "task": "transcribe"}
	)

	result = pipe("path/to/audio/file.wav") # path to audio file or numpy array of wave
	print(result["text"])
	```

	Note: It is recommended that audio input should have sample_rate=16_000 before hand !

	### Transformer Directly

	You can use this model from Transfomers module driectly:

	```python
	from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
	import torch

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	processor = AutoProcessor.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper")
	model = AutoModelForSpeechSeq2Seq.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper").to(device)

	# waveform is numpy that obtain from Audio processor lib i.e. librosa, torchaudio

	input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features.to(device)

	with torch.no_grad():
	predicted_ids = model.generate(input_features)

	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] # best choice of batches

	from ratchada_processor import tokenize_text # strongly recommend post-processor

	processed_text = tokenize_text(transcription) # cut the text into splited component and process it (see github)

	result = "".join(processed_text)

	print(result)
	```

	Note: Using this method required own manually post-processor at the output of the model. The post-processor can be found in this lib on pypi project:

	```bash
	python3 -m pip install ratchada-util
	```

	## Training

	### Training Data

	This model was fine-tuned on a proprietary dataset: ThinkingMachinesDataScience/Ratchada-STT. The dataset contains Thai speech audio from financial contexts.

	### Training Procedure

	The model was fine-tuned from the biodatlab/whisper-th-medium-combined checkpoint, which is a Thai-specific version of the Whisper medium model.
	After each model prediction, a post-processor code is applied to refine the results.

	## Limitations and Bias

	1. The model is specifically trained on Thai financial audio data and may not perform as well on general Thai speech or other domains.
	2. There might be biases present in the training data, which could affect the model's performance on certain types of speech or accents.

	## Evaluation Results

	Using our own evaluation algorithm, these are the performance of this model:
	* Lower is better

	\| models \| wer \| cer (jiwer) \| deletions \| substitutions \| insertions \|
	\|----------------------\|-------------\|-------------\|-----------\|---------------\|-------------\|
	\| RATFT-WHISPER \| 0.332685 \| 0.272674 \| 1884 \| 1806 \| 5466 \|
	\| WHISPER-LARGE-V3 \| 0.392162 \| 0.318666 \| 2499 \| 1489 \| 6752 \|
	\| THON-WHISPER \| 0.474360 \| 0.405920 \| 1722 \| 2603 \| 8597 \|
	\| WHISPER-LARGE \| 0.593637 \| 0.578926 \| 5441 \| 1500 \| 9433 \|
	\| WHISPER-LARGE-V2 \| 0.595292 \| 0.652592 \| 4924 \| 1866 \| 9580 \|
	\| WHISPER-MEDIUM \| 0.643084 \| 0.66565 \| 7471 \| 1312 \| 9090 \|
	\| WHISPER-SMALL \| 0.667453 \| 0.603361 \| 4397 \| 1817 \| 12028 \|
	\| WHISPER-BASE \| 0.791954 \| 0.73896 \| 3362 \| 1906 \| 16252 \|

	Note: CER, Using [Jiwer](https://pypi.org/project/jiwer/), to evaluate an automatic speech recognition system.

	## Ethical Considerations
	Users should be aware that this model is designed for transcribing Thai speech in financial contexts. It should not be used for making financial decisions without human verification. Always cross-check important financial information obtained from this model.

	## Citations
	If you use this model in your research, please cite:
	```
	Copy@misc{Ratchada-Fang-Thon-Whisper,
	author = {ThinkingMachinesDataScience},
	title = {Ratchada-Fang-Thon-Whisper: Thai Financial Speech Recognition Model},
	year = {2023},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://huggingface.co/ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper}}
	}
	```

	## Contacts
	For questions and feedback about this model, please make a contact [ThinkingMachinesDataScience](https://github.com/thinkingmachines/set-speechtotext-poc) Github repository for this project.