slseanwu
/

compose-and-embellish-pop1k7

music-generation

Model card Files Files and versions Community

compose-and-embellish-pop1k7 / README.md

slseanwu's picture

add doc for gpt2 embellish model

fc173db verified 5 months ago

|

history blame contribute delete

3.07 kB

	---
	tags:
	- music-generation
	- transformer
	- pytorch
	- audio
	- music
	- piano
	license: mit
	---
	# Compose & Embellish: Piano Performance Generation Pipeline
	Trained model weights and training datasets for the paper:
	* Shih-Lun Wu and Yi-Hsuan Yang
	"[Compose & Embellish: Well-Structured Piano Performance Generation via A Two-Stage Approach](https://arxiv.org/abs/2209.08212)."
	_Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)_, 2023

	Note: Materials here should be used in conjunction with our [model implementation Github repo](https://github.com/slSeanWU/Compose_and_Embellish).

	## Model characteristics
	### Stage 1: "Compose" model
	Generates melody and chord progression from scratch.

	- Model backbone: 12-layer Transformer w/ relative positional encoding
	- Num trainable params: 41.3M
	- Token vocabulary: [Revamped MIDI-derived events](https://arxiv.org/abs/2002.00212) (REMI) w/ slight modifications
	- Pretraining dataset: subset of [Lakh MIDI full](https://colinraffel.com/projects/lmd/) (LMD-full), 14934 songs
	- melody extraction (and data filtering) done by matching lyrics to tracks: https://github.com/gulnazaki/lyrics-melody/blob/main/pre-processing/create_dataset.py
	- structural segmentation done with *A\ search**: https://github.com/Dsqvival/hierarchical-structure-analysis
	- Finetuning dataset: subset of [AILabs.tw Pop1K7](https://github.com/YatingMusic/compound-word-transformer) (Pop1K7), 1591 songs
	- melody extraction done with skyline algorithm: https://github.com/wazenmai/MIDI-BERT/blob/CP/melody_extraction/skyline/analyzer.py
	- structural segmentation done in the same way as pretraining dataset
	- Training sequence length: 2400
	### Stage 2: "Embellish" model
	Generates accompaniment, timing and dynamics conditioned on Stage 1 outputs.
	- `embellish_model_gpt2_pop1k7_loss0.398.bin`
	- Model backbone: 12-layer GPT-2 Transformer ([implementation](https://huggingface.co/docs/transformers/en/model_doc/gpt2))
	- Num trainable params: 38.2M
	- `embellish_model_pop1k7_loss0.399.bin` (requires `fast-transformers` package, which is outdated as of Jul. 2024)
	- Model backbone: 12-layer Performer ([paper](https://arxiv.org/abs/2009.14794), [implementation](https://github.com/idiap/fast-transformers))
	- Num trainable params: 38.2M
	- Token vocabulary: [Revamped MIDI-derived events](https://arxiv.org/abs/2002.00212) (REMI) w/ slight modifications
	- Training dataset: [AILabs.tw Pop1K7](https://github.com/YatingMusic/compound-word-transformer) (Pop1K7), 1747 songs
	- Training sequence length: 3072

	## BibTex
	If you find the materials useful, please consider citing our work:
	```
	@inproceedings{wu2023compembellish,
	title={{Compose \& Embellish}: Well-Structured Piano Performance Generation via A Two-Stage Approach},
	author={Wu, Shih-Lun and Yang, Yi-Hsuan},
	booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)},
	year={2023},
	url={https://arxiv.org/pdf/2209.08212.pdf}
	}
	```