PicoAudio: Enabling Precise Timing and Frequency Controllability of Audio Events in Text-to-audio Generation

Bullet contribution:

A data simulation pipeline tailored specifically for controllable audio generation frameworks;
Propose a timing-controllable audio generation framework, enabling precise control over the timing and frequency of sound event;
Achieve any precise control related to timing by integrating of large language models.

Inference

You can see the demo on the website Huggingface Online Inference and Github Demo. Or you can use the "inference.py" script provided by website Huggingface Inference to generate. Huggingface Online Inference uses Gemini as a preprocessor, and we also provide a GPT preprocessing script consistent with the paper in "llm_preprocess.py"

Simulated Dataset

Simulated data can be downloaded from (1) HuggingfaceDataset or (2) BaiduNetDisk with the extraction code "pico".
The metadata is stored in "data/meta_data/{}.json", one instance is as follows:

{
  "filepath": "data/multi_event_test/syn_1.wav",
  "onoffCaption": "cat meowing at 0.5-2.0, 3.0-4.5 and whistling at 5.0-6.5 and explosion at 7.0-8.0, 8.5-9.5",
  "frequencyCaption": "cat meowing two times and whistling one times and explosion two times"
}

where:

"filepath" indicates the path to the audio file.
"frequencyCaption" contains information about the occurrence frequency.
"onoffCaption" contains on- & off-set information.
For test file "test-frequency-control_onoffFromGpt_{}.json", the "onoffCaption" is derived from "frequencyCaption" transformed by GPT-4, which is used for evaluation in the frequency control task.

Training

Download data into the "data" folder. The training and inference code can be found in the "picoaudio" folder.

cd picoaudio
pip install -r requirements.txt

To start traning:

  accelerate launch runner/controllable_train.py

Acknowledgement

Our code referred to the AudioLDM and Tango. We appreciate their open-sourcing of their code.

ZeyuXie
/

PicoAudio

PicoAudio: Enabling Precise Timing and Frequency Controllability of Audio Events in Text-to-audio Generation

Inference

Simulated Dataset

Training

Acknowledgement

Space using ZeyuXie/PicoAudio 1