kokoro-base-onnx / README.md

Update README

99cab3d 3 days ago

4.11 kB

	---
	tags:
	- audio
	- text-to-speech
	- onnx
	base_model:
	- hexgrad/Kokoro-82M
	inference: false
	language: en
	license: apache-2.0
	library_name: txtai
	---

	# Kokoro Base (82M) Model for ONNX

	[Kokoro 82M](https://huggingface.co/hexgrad/Kokoro-82M) export to ONNX. This model is the same ONNX file that's in the base repository. The voices file is from [this repository](https://github.com/thewh1teagle/kokoro-onnx/releases/tag/model-files).

	## Usage with txtai

	[txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy.

	_Note: This requires txtai >= 8.3.0. Install from GitHub until that release._

	```python
	import soundfile as sf

	from txtai.pipeline import TextToSpeech

	# Build pipeline
	tts = TextToSpeech("NeuML/kokoro-base-onnx")

	# Generate speech
	speech, rate = tts("Say something here")

	# Write to file
	sf.write("out.wav", speech, rate)
	```

	## Usage with ONNX

	This model can also be run directly with ONNX provided the input text is tokenized. Tokenization can be done with [ttstokenizer](https://github.com/neuml/ttstokenizer). `ttstokenizer` is a permissively licensed library with no external dependencies (such as espeak).

	Note that the txtai pipeline has additional functionality such as batching large inputs together that would need to be duplicated with this method.

	```python
	import json
	import numpy as np
	import onnxruntime
	import soundfile as sf

	from ttstokenizer import IPATokenizer

	# This example assumes the files have been downloaded locally
	with open("kokoro-base-onnx/voices.json", "r", encoding="utf-8") as f:
	voices = json.load(f)

	# Create model
	model = onnxruntime.InferenceSession(
	"kokoro-base-onnx/model.onnx",
	providers=["CPUExecutionProvider"]
	)

	# Create tokenizer
	tokenizer = IPATokenizer()

	# Tokenize inputs
	inputs = tokenizer("Say something here")

	# Get speaker array
	speaker = np.array(self.voices["af"], dtype=np.float32)

	# Generate speech
	outputs = model.run(None, {
	"tokens": [[0, *inputs, 0]],
	"style": speaker[len(inputs)],
	"speed": np.ones(1, dtype=np.float32) * 1.0
	})

	# Write to file
	sf.write("out.wav", outputs[0], 24000)
	```

	## Speaker reference

	The Kokoro model has a number of built-in speakers.

	When using this model, set a `speaker` id from the reference table below.

	\| SPEAKER \| GENDER \| NATIONALITY \| EXAMPLE \|
	\|:------------\|:---------\|:---------------\|:-----------------\|
	\| af \| F \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/af.mp3) \|
	\| af_bella \| F \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/af_bella.mp3) \|
	\| af_nicole \| F \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/af_nicole.mp3) \|
	\| af_sarah \| F \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/af_sarah.mp3) \|
	\| af_sky \| F \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/af_sky.mp3) \|
	\| am_adam \| M \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/am_adam.mp3) \|
	\| af_michael \| M \| American \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/am_michael.mp3) \|
	\| bf_emma \| F \| British \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/bf_emma.mp3) \|
	\| bf_isabella \| F \| British \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/bf_isabella.mp3) \|
	\| bm_george \| M \| British \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/bm_george.mp3) \|
	\| bm_lewis \| M \| British \| [Link](https://huggingface.co/NeuML/kokoro-base-onnx/resolve/main/examples/bm_lewis.mp3) \|

	The following shows an example on how to set a speaker id when using txtai

	```python
	speech, rate = tts("Say something here", speaker="af_sky")
	```