Swahili MMS TTS - Finetuned Model

This is a fine-tuned version of the Facebook MMS (Massively Multilingual Speech) model for Swahili Text-to-Speech (TTS). The model was fine-tuned to improve Swahili pronunciation and performance using custom audio datasets.

Model Details

Model Name: Swahili MMS TTS - Finetuned
Languages Supported: Swahili
Base Model: Facebook MMS
Use Case: Text-to-Speech for Swahili language, suitable for generating high-quality speech from text.

Training Details

The fine-tuning process was done using a custom dataset of Swahili voice samples to improve the fluency and accuracy of the original MMS model in Swahili. This resulted in enhanced pronunciation and natural-sounding speech for Swahili.

You can check out the code and process used in the fine-tuning by visiting the GitHub repository.

How to Use

You can load and use the model directly from the Hugging Face model hub using either the pipeline API or by manually downloading the model and tokenizer.

1. Download and Run the Model Directly

You can also download the model and tokenizer manually and run the text-to-speech pipeline without the Hugging Face pipeline helper. Here's how:

import torch
import numpy as np
import scipy.io.wavfile
from transformers import VitsModel, AutoTokenizer


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Benjamin-png/swahili-mms-tts-finetuned"
text = "Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili."
audio_file_path = "swahili_speech.wav"

# Load model and tokenizer dynamically based on the provided model name
model = VitsModel.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Step 1: Tokenize the input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Step 2: Generate waveform
with torch.no_grad():
    output = model(**inputs).waveform

# Step 3: Convert PyTorch tensor to NumPy array
output_np = output.squeeze().cpu().numpy()

# Step 4: Write to WAV file
scipy.io.wavfile.write(audio_file_path, rate=model.config.sampling_rate, data=output_np)

2. Using the `pipeline` API

from transformers import pipeline

# Load the fine-tuned model
tts = pipeline("text-to-speech", model="Benjamin-png/swahili-mms-tts-finetuned")

# Generate speech from text
speech = tts("Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili.")

Saving and Playing the Audio

To save and play the audio, you can use the same methods mentioned above:

Saving the Audio

import soundfile as sf

# Save the audio as a WAV file
sf.write("swahili_speech.wav", output_np, model.config.sampling_rate)

Playing the Audio

You can play the audio using pydub:

from pydub import AudioSegment
from pydub.playback import play

# Load and play the generated audio
audio = AudioSegment.from_wav("swahili_speech.wav")
play(audio)

Make sure to install the required libraries:

pip install torch transformers numpy soundfile scipy pydub

Example Notebook

If you're interested in reproducing the fine-tuning process or using the model for similar purposes, you can check out the Google Colab notebook that outlines the entire process:

Google Colab Notebook

The notebook includes detailed steps on how to fine-tune the MMS model for Swahili TTS.

GitHub Repository

For further exploration and code snippets, visit the GitHub repository where you’ll find additional scripts, datasets, and instructions for customizing the model.

License

This project is licensed under the terms of the Apache License 2.0.