Model's Improvment

This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.

Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker

This model is a fine-tuned version of theainerd/Wav2Vec2-large-xlsr-hindi on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a Word Error Rate (WER) of 54%, compared to the base model’s WER of 72% on the same dataset.

Model description

This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.

Intended uses & limitations

This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled.

Usage

The model can be used directly (without a language model) as follows:

import torch import torchaudio from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

Load the Hindi Common Voice dataset

test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")

Load the processor and model

processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") resampler = torchaudio.transforms.Resample(48_000, 16_000)

Function to process the dataset

def speech_file_to_array_fn(batch): speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch

test_dataset = test_dataset.map(speech_file_to_array_fn) inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

Perform inference

with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1) print("Prediction:", processor.batch_decode(predicted_ids)) print("Reference:", test_dataset["sentence"][:2])

Evaluation

The model can be evaluated as follows on the Hindi test data of Common Voice.

import torch import torchaudio from datasets import load_dataset, load_metric from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import re

Load the dataset and metrics

test_dataset = load_dataset("common_voice", "hi", split="test") wer = load_metric("wer")

Initialize processor and model

processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000) chars_to_ignore_regex = '[,?.!-;:"\“]'

Function to preprocess the data

def speech_file_to_array_fn(batch): batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

Evaluation function

def evaluate(batch): inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits pred_ids = torch.argmax(logits, dim=-1) batch["pred_strings"] = processor.batch_decode(pred_ids) return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8) print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Limitations:

The model may face challenges with dialectal or regional variations within Hindi.
Performance can degrade with noisy audio or overlapping speech.
It is not intended for real-time transcription due to latency considerations.

Training and evaluation data

The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.

Training procedure

Hyperparameters and setup:

The following hyperparameters were used during training:

Learning rate: 1e-4
Batch size: 16 (per device)
Gradient accumulation steps: 2
Evaluation strategy: steps
Max steps: 2500
Mixed precision: FP16
Save steps: 500
Evaluation steps: 500
Logging steps: 500
Warmup steps: 500
Save total limit: 1

Training output

Global step: 2500
Training runtime: Approximately 1 hour 21 minutes
Epochs: 5-6

Training results

Step	Training Loss	Validation Loss	WER
500	5.603000	0.987691	0.7556
1000	0.720300	0.667561	0.6196
1500	0.507000	0.592814	0.5844
2000	0.431100	0.549786	0.5439
2500	0.395600	0.537703	0.5428

Framework versions

Transformers: 4.42.4 PyTorch: 2.3.1+cu121 Datasets: 2.20.0 Tokenizers: 0.19.1

yash072
/

wav2vec2-large-XLSR-Hindi-YashR