Add details on fine-tuning

#11
by SkyyySi - opened

For this model to really be useful, it should be possible to locally fine-tune it, e.g. to produce different genres of music that were underrepresented in the dataset.

Seems like there are instructions already: https://github.com/Stability-AI/stable-audio-tools?tab=readme-ov-file#fine-tuning

That said, this only for full fine-tunes. Being able to train and share, like, LoRAs or something like that would be dope.

Hello @SkyyySi (and everyone else)

When I try to finetune the model, it throws a ValueError

ValueError: Conditioner key prompt not found in batch metadata

I use --pretrained-ckpt-path using an unwrapped model

EDIT: My fault, I had an invalid config file for my dataset. Sorry

@ImusingX what did you use as your model-config along with the checkpoint?

@ImusingX Please share what you did, as I am stuck on that as well.

@ImusingX Please share what you did, as I am stuck on that as well.

My dataset config had the custom_metadata_module in the root of the json file but it had to be within the dataset key

As in, I had:

    "dataset_type": "audio_dir",
    "datasets": [
        {
           ...
        }
    ],
    "custom_metadata_module": "/mnt/e/tts/ok/stable-audio-tools-main/mydata/metadata.py"  <--- here
    "random_crop": true
}

but it had to be

    "dataset_type": "audio_dir",
    "datasets": [
        {
           ...
            "custom_metadata_module": "/mnt/e/tts/ok/stable-audio-tools-main/mydata/metadata.py" <--- should be here
        }
    ],
    "random_crop": true
}

@ImusingX Thank you, was able to get it running by adding a custom metadata module (I tried to just place .json files next to each audio file, but that didn't seem to work).

However, as it turns out, 24 GB of RAM seems to not be enough for full fine-tuning. I tried it on Arch Linux with the latest Pytorch for ROCm nightly build on an RX 7900 XTX, but I always get an out-of-memory error - even when stopping any other graphics process.

Unless someone develops a LoRA / embedding trainer for this, training on consumer hardware seems to be out of the question for now.

Stability AI org

No official LoRA training in stable-audio-tools yet, but there is this repo that was made by a community member to train LoRAs for stable-audio-tools models. I haven't tried it myself but I've heard others getting it to work: https://github.com/NeuralNotW0rk/LoRAW

When finetuning I always just get noise in the demo audio files. I used these configs as a base with pre-trained laion_clap checkpoint I found. But no joy.

When finetuning I always just get noise in the demo audio files. I used these configs as a base with pre-trained laion_clap checkpoint I found. But no joy.

Use the LoRAW ( https://github.com/NeuralNotW0rk/LoRAW ), that works very well for me

@ImusingX what is the rest of your config with LoRAW, and do you need weights for that too?

I realised the training config used for this model checkpoint is in the hugging face files! As model_config.json! This is what I missed.

Sign up or log in to comment