Step-Audio-Tokenizer

Step-Audio LLM is the industry’s first 130-billion parameter hu-manlike unified end-to-end model that integrates multimodal speech un-derstanding and generation capabilities, including singing voice synthesis, tool utilization, role-play and multilingual/dialectal comprehension and synthesis.

This repository provides the speech tokenizer component of Step-Audio LLM. For linguistic tokenization, we utilize the output from the Paraformer encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic tokenization, we employ CosyVoice’s tokenizer, specifically designed to efficiently encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz.

More information

For more information, please refer to our repository: Step-Audio.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using stepfun-ai/Step-Audio-Tokenizer 1

Collection including stepfun-ai/Step-Audio-Tokenizer