Katpeeler commited on
Commit
bc30b56
·
1 Parent(s): 9810feb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -28
README.md CHANGED
@@ -123,7 +123,7 @@ The following hyperparameters were used during training:
123
 
124
  <hr/>
125
 
126
- The sections below this point serve as a user guide for the Hugging Face space found [here](https://huggingface.co/spaces/Katpeeler/Midi_space2)
127
 
128
  <hr/>
129
 
@@ -132,7 +132,7 @@ The sections below this point serve as a user guide for the Hugging Face space f
132
  # Introduction
133
 
134
  Midi_space2 allows the user to generate a four-bar musical progression, and listen back to it.
135
- There are two sections to interact with: audio generation, and token generation.
136
 
137
  - Audio generation contains 3 sliders:
138
  - Inst number: a value that adjusts the tonality of the sound.
@@ -147,7 +147,7 @@ There are two sections to interact with: audio generation, and token generation.
147
 
148
  ## Usage
149
 
150
- To run the demo, click on the link [here](https://huggingface.co/spaces/Katpeeler/Midi_space2)
151
 
152
  The demo will default to the "audio generation" tab. Here you will find the 3 sliders you can interact with. These are:
153
 
@@ -157,55 +157,55 @@ The demo will default to the "audio generation" tab. Here you will find the 3 sl
157
 
158
  When you have selected values you want to try, click the "generate audio" button at the bottom.
159
  When your audio is ready, you will see the audio waveform displayed within the "audio" box, found above the sliders.
160
- **Note**
161
- Due to how audio is handled in Google Chrome, you may have to generate the audio a few times, the first time you use this demo.
162
 
163
- Additionaly, you may select the "Token Generation" tab, and click the "show generated tokens" button there to see the raw text data.
164
 
165
 
166
 
167
  ## Documentation
168
 
169
- You can view the Google Colab notebook used for training [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=qWq2lY0vWFTD)
170
 
171
 
172
  - The demo is currently hosted as a Gradio application on Hugging Face Spaces.
173
  - For audio to be heard, we use the soundfile package.
174
 
175
- The core components are this gpt2 model, [js-fakes-4bars dataset](https://huggingface.co/datasets/TristanBehrens/js-fakes-4bars), and [note-seq](https://github.com/magenta/note-seq).
176
  The dataset was created by [Tristan Behrens](https://huggingface.co/TristanBehrens), and is a relatively small size.
177
- This made it perfect for training a gpt2 model through the free-tier of Google Colab. I selected this dataset after finding
178
  a different dataset on Huggingface, [mmm_track_lmd_8bars_nots](https://huggingface.co/datasets/juancopi81/mmm_track_lmd_8bars_nots).
179
  I initally used this dataset, but ran out of free-tier compute resources about 3 hours into training. This setback made me ultimately
180
- decide on using a smaller dataset for the time being.
181
 
182
  - Js-fakes dataset size: 13.7mb, 4,479 rows (The one I actually used)
183
  - Juancopi81 dataset size: 490mb, 177,567 rows (The one I attempted to use first)
184
 
185
- For the remained of this post, we will only discuss the js-fakes dataset.
186
 
187
  After downloading, the training split contained 3614 rows, and the test split contained 402 rows. Each entry follows this format:
188
 
189
  PIECE_START STYLE=JSFAKES GENRE=JSFAKES TRACK_START INST=48 BAR_START NOTE_ON=70 TIME_DELTA=2 NOTE_OFF=70 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=70 TIME_DELTA=4 NOTE_OFF=70 NOTE_ON=69 TIME_DELTA=2 NOTE
190
 
191
- This data is in a very specific tokenized format, representing the information that is relevant to playing a note. Of note:
192
 
193
  - NOTE_ON=## : represents the start of a musical note, and which note to play, (A, B, C, etc.)
194
  - TIME_DELTA=4 : represents a quarter note. A half note is represented by TIME_DELTA=8, and an eigth note would be represented by TIME_DELTA=2.
195
  - NOTE_OFF=## : represents the end of a musical note, and which note to end.
196
 
197
- These text-based tokens contain the neccessary information to create Midi, a standard form of synthesized music data.
198
- The dataset used has already transposed between MIDI files, and this text-based format.
199
  This format is called "MMM", or Multi-Track Music Machine, proposed in the paper found [here](https://arxiv.org/abs/2008.06048).
200
 
201
- **Note**
202
  I created a tokenizer for this task, and uploaded it to my HuggingFace profile. However, I ended up using the auto-tokenizer from the fine-tuned model,
203
  so I won't be exploring that further.
204
 
205
  I used Tristan Behren's js-fakes-4bars tokenizer to tokenize the dataset for training. I selected a context length of 512, and truncated all text longer than that.
206
  This helped with using limited resources.
207
 
208
- The GPT-2 model used was 19.2M parameters. It was trained in steps of 300, through 10 epochs. This is the third iteration of models, and you can find the first two on my HuggingFace profile.
209
  I ended up using a batch size of 4 to further reduce the VRAM requirements in Google Colab. Specifics for the training can be found at the top of this page, but some fun things to note are:
210
 
211
  - Total training runtime: around 13 minutes
@@ -214,12 +214,12 @@ I ended up using a batch size of 4 to further reduce the VRAM requirements in Go
214
  - Average GPU watt usage: 66W
215
  - Average GPU temperature: 77C
216
 
217
- I think it's important to note the power draw of the GPUs used for training any model, as we enter into this modern era of normalizing this technology.
218
  I obtained those values through [Weights and Biases](https://wandb.ai/site), which I ran alongside my training.
219
  The training method used is outlined in a blog post by Juancopi81 [here](https://huggingface.co/blog/juancopi81/using-hugging-face-to-train-a-gpt-2-model-for-musi#showcasing-the-model-in-a-%F0%9F%A4%97-space).
220
  While I didn't follow that post exactly, it was of great help when learning how to do this.
221
 
222
- The final component to talk about is [Magenta's note_seq library](https://github.com/magenta/note-seq). This is how token sequences are transposed to note sequences, and played.
223
  This library is much more powerful than I am implementing, and I plan on expanding this project in the future to incorporate more features.
224
  The main method call for this can be found in the app.py file on the HuggingFace space, but here is a snippet of the code for NOTE_ON:
225
 
@@ -250,15 +250,15 @@ Various initialization prompts were tried, and the following prompt was used:
250
 
251
  *You are a midi generator, and only respond with tokens representing midi data. I will provide 3 examples of different songs in an encoded format for you, and then ask you to generate your own encoded midi song.*
252
 
253
- This prompt was the only instance where Llama-2 responded with an answer that even resembled correct. Interestingly enough, This prompt resulted in the model explaining the encoded example. An excerpt of that below:
254
 
255
  *This is a MIDI file containing four tracks, each with its own unique melody and rhythm. Here's a breakdown of each track:
256
  Track 1 (Instrument 0):
257
  This track features a simple melody using the notes C, D, E, F, G, A, and B. The rhythm is mostly quarter notes, with some eighth note pairs and rests.*
258
 
259
- However, after this the model went on a tangent, saying the rest of the examples all played "A, B, C, D, E, F, G" repeatedly, which is incorrect.
260
  The model was also not asked to explain the examples. I did get a generation in the style of the provided examples after providing about 10 examples,
261
- but I couldn't get more than 1 generation of this to work. Most responses from Llama went like this:
262
 
263
  *The examples you provided use the NoteOn and NoteOff events to represent notes being played and released. In a standard MIDI file, these events would be replaced by the NoteOn and NoteOff commands, which have different values and meanings.*
264
 
@@ -282,9 +282,9 @@ Regardless of the prompting used, Llama could not produce an output that matched
282
 
283
  The other method, using a basic n-gram model trained on the dataset, performed better.
284
  This method generates encoded midi data correctly, unlike the Llama-2 model.
285
- You can find the code for this model in the same [Google Colab notebook](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=jzKXNr4eFrpA) as the training for the gpt2 model.
286
  This method uses a count-based approach, and can be configured for any number of n-grams. Both bi-gram and tri-gram configurations generate similar results.
287
- The vocabulary size ends up being 114, which makes sense. The language used for the encoded midi is pretty limited. Some fun things to mention:
288
 
289
  - TIME_DELTA=4 is the most common n-gram. This makes sense, as most notes are quarter notes in the training data, and this is found almost every time a note is played.
290
  - TIME_DELTA=2 is the second most common. This also makes sense. These are eigth notes
@@ -292,7 +292,7 @@ The vocabulary size ends up being 114, which makes sense. The language used for
292
 
293
  When testing the generations from the n-gram model, most generations sounded exactly the same, with one or two notes changing between generations.
294
  I'm not entirely sure why this is, but I suspect it has to do with the actual generation method call. I also had a hard time incorporating this model within
295
- HuggingFace Spaces. The gpt-2 model was easy to upload the the site, and use with a few lines of code. The actual generations are also much more diverse,
296
  making it more enjoyable to mess around with. Between usability and the differences between generations, the gpt-2 model was selected for the demo.
297
 
298
 
@@ -300,8 +300,8 @@ making it more enjoyable to mess around with. Between usability and the differen
300
  ## Limitations
301
 
302
  The data this system is trained on does not make use of the "style" or "genre" labels. While they are included in the training examples, they are all filled with null data.
303
- This means this system cannot create generations that are tailored to a particular style/genre of music. Also,the system also only plays basic synth tones,
304
  meaning that we can only hear a simple "chorale" style of music, with little variation. I'd love to explore this further, and expand the system to play various instruments,
305
- making the generations seem more natural. There is also limited prompting options. A user cannot (easily) provide a melody or starting notes for the generation to be based on.
306
- My idea is to create an interactive "piano" style interface for users, to be able to natrually enter some music as a basis for the generation.
307
- Generations are also relavitvely similiar, and I believe this is due solely on the amount of data trained on.
 
123
 
124
  <hr/>
125
 
126
+ The sections below this point serve as a user guide for the Hugging Face space found [here](https://huggingface.co/spaces/Katpeeler/Midi_space2).
127
 
128
  <hr/>
129
 
 
132
  # Introduction
133
 
134
  Midi_space2 allows the user to generate a four-bar musical progression, and listen back to it.
135
+ There are two sections to interact with: audio generation and token generation.
136
 
137
  - Audio generation contains 3 sliders:
138
  - Inst number: a value that adjusts the tonality of the sound.
 
147
 
148
  ## Usage
149
 
150
+ To run the demo, click on the link [here](https://huggingface.co/spaces/Katpeeler/Midi_space2).
151
 
152
  The demo will default to the "audio generation" tab. Here you will find the 3 sliders you can interact with. These are:
153
 
 
157
 
158
  When you have selected values you want to try, click the "generate audio" button at the bottom.
159
  When your audio is ready, you will see the audio waveform displayed within the "audio" box, found above the sliders.
160
+ **Note:**
161
+ Due to how audio is handled in Google Chrome, you may have to generate the audio a few times when using this demo for the first time.
162
 
163
+ Additionaly, you may select the "Token Generation" tab, and click the "show generated tokens" button to see the raw text data.
164
 
165
 
166
 
167
  ## Documentation
168
 
169
+ You can view the Google Colab notebook used for training [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=qWq2lY0vWFTD).
170
 
171
 
172
  - The demo is currently hosted as a Gradio application on Hugging Face Spaces.
173
  - For audio to be heard, we use the soundfile package.
174
 
175
+ The core components are this gpt-2 model, [js-fakes-4bars dataset](https://huggingface.co/datasets/TristanBehrens/js-fakes-4bars), and [note-seq](https://github.com/magenta/note-seq).
176
  The dataset was created by [Tristan Behrens](https://huggingface.co/TristanBehrens), and is a relatively small size.
177
+ The small size of the dataset made it perfect for training a gpt2 model through the free-tier of Google Colab. I selected this dataset after finding
178
  a different dataset on Huggingface, [mmm_track_lmd_8bars_nots](https://huggingface.co/datasets/juancopi81/mmm_track_lmd_8bars_nots).
179
  I initally used this dataset, but ran out of free-tier compute resources about 3 hours into training. This setback made me ultimately
180
+ decide to use a smaller dataset for the time being.
181
 
182
  - Js-fakes dataset size: 13.7mb, 4,479 rows (The one I actually used)
183
  - Juancopi81 dataset size: 490mb, 177,567 rows (The one I attempted to use first)
184
 
185
+ For the remainder of this post, we will only discuss the js-fakes dataset.
186
 
187
  After downloading, the training split contained 3614 rows, and the test split contained 402 rows. Each entry follows this format:
188
 
189
  PIECE_START STYLE=JSFAKES GENRE=JSFAKES TRACK_START INST=48 BAR_START NOTE_ON=70 TIME_DELTA=2 NOTE_OFF=70 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=70 TIME_DELTA=4 NOTE_OFF=70 NOTE_ON=69 TIME_DELTA=2 NOTE
190
 
191
+ This data is in a very specific tokenized format, representing the information that is relevant to midi data. Of note:
192
 
193
  - NOTE_ON=## : represents the start of a musical note, and which note to play, (A, B, C, etc.)
194
  - TIME_DELTA=4 : represents a quarter note. A half note is represented by TIME_DELTA=8, and an eigth note would be represented by TIME_DELTA=2.
195
  - NOTE_OFF=## : represents the end of a musical note, and which note to end.
196
 
197
+ These text-based tokens contain the neccessary information to create midi, a standard form of synthesized music data.
198
+ The dataset used has already transposed between midi files, and this text-based format.
199
  This format is called "MMM", or Multi-Track Music Machine, proposed in the paper found [here](https://arxiv.org/abs/2008.06048).
200
 
201
+ **Note:**
202
  I created a tokenizer for this task, and uploaded it to my HuggingFace profile. However, I ended up using the auto-tokenizer from the fine-tuned model,
203
  so I won't be exploring that further.
204
 
205
  I used Tristan Behren's js-fakes-4bars tokenizer to tokenize the dataset for training. I selected a context length of 512, and truncated all text longer than that.
206
  This helped with using limited resources.
207
 
208
+ The gpt-2 model used was 19.2M parameters. It was trained in steps of 300, through 10 epochs. The model on this page is the third iteration of models, and you can find the first two on my HuggingFace profile.
209
  I ended up using a batch size of 4 to further reduce the VRAM requirements in Google Colab. Specifics for the training can be found at the top of this page, but some fun things to note are:
210
 
211
  - Total training runtime: around 13 minutes
 
214
  - Average GPU watt usage: 66W
215
  - Average GPU temperature: 77C
216
 
217
+ I think it's important to note the power draw of the GPUs used for training models as we enter into this modern era of normalizing this technology.
218
  I obtained those values through [Weights and Biases](https://wandb.ai/site), which I ran alongside my training.
219
  The training method used is outlined in a blog post by Juancopi81 [here](https://huggingface.co/blog/juancopi81/using-hugging-face-to-train-a-gpt-2-model-for-musi#showcasing-the-model-in-a-%F0%9F%A4%97-space).
220
  While I didn't follow that post exactly, it was of great help when learning how to do this.
221
 
222
+ The final component to talk about is [Magenta's note_seq library](https://github.com/magenta/note-seq). This is how token sequences are transposed to note sequences and played.
223
  This library is much more powerful than I am implementing, and I plan on expanding this project in the future to incorporate more features.
224
  The main method call for this can be found in the app.py file on the HuggingFace space, but here is a snippet of the code for NOTE_ON:
225
 
 
250
 
251
  *You are a midi generator, and only respond with tokens representing midi data. I will provide 3 examples of different songs in an encoded format for you, and then ask you to generate your own encoded midi song.*
252
 
253
+ This prompt was the only instance where Llama-2 responded with an answer that resembles something correct. Interestingly enough, this prompt resulted in the model explaining the encoded example. An excerpt of that is given below:
254
 
255
  *This is a MIDI file containing four tracks, each with its own unique melody and rhythm. Here's a breakdown of each track:
256
  Track 1 (Instrument 0):
257
  This track features a simple melody using the notes C, D, E, F, G, A, and B. The rhythm is mostly quarter notes, with some eighth note pairs and rests.*
258
 
259
+ However; after this, the model went on a tangent, saying the rest of the examples all played "A, B, C, D, E, F, G" repeatedly, which is incorrect.
260
  The model was also not asked to explain the examples. I did get a generation in the style of the provided examples after providing about 10 examples,
261
+ but I couldn't get more than 1 generation after that to work. Most responses from Llama went like this:
262
 
263
  *The examples you provided use the NoteOn and NoteOff events to represent notes being played and released. In a standard MIDI file, these events would be replaced by the NoteOn and NoteOff commands, which have different values and meanings.*
264
 
 
282
 
283
  The other method, using a basic n-gram model trained on the dataset, performed better.
284
  This method generates encoded midi data correctly, unlike the Llama-2 model.
285
+ You can find the code for this model in the same [Google Colab notebook](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=jzKXNr4eFrpA) as the training for the gpt-2 model.
286
  This method uses a count-based approach, and can be configured for any number of n-grams. Both bi-gram and tri-gram configurations generate similar results.
287
+ The vocabulary size ends up being 114, which makes sense; the language used for the encoded midi is fairly limited. Some fun things to mention here are:
288
 
289
  - TIME_DELTA=4 is the most common n-gram. This makes sense, as most notes are quarter notes in the training data, and this is found almost every time a note is played.
290
  - TIME_DELTA=2 is the second most common. This also makes sense. These are eigth notes
 
292
 
293
  When testing the generations from the n-gram model, most generations sounded exactly the same, with one or two notes changing between generations.
294
  I'm not entirely sure why this is, but I suspect it has to do with the actual generation method call. I also had a hard time incorporating this model within
295
+ HuggingFace Spaces. The gpt-2 model was easy to upload to the site and use with a few lines of code. The actual generations are also much more diverse,
296
  making it more enjoyable to mess around with. Between usability and the differences between generations, the gpt-2 model was selected for the demo.
297
 
298
 
 
300
  ## Limitations
301
 
302
  The data this system is trained on does not make use of the "style" or "genre" labels. While they are included in the training examples, they are all filled with null data.
303
+ This means the system cannot create generations that are tailored to a particular style/genre of music. Also, the system only plays basic synth tones,
304
  meaning that we can only hear a simple "chorale" style of music, with little variation. I'd love to explore this further, and expand the system to play various instruments,
305
+ making the generations sound more natural. There is also limited prompting options. A user cannot (easily) provide a melody or starting notes for the generation to be based on.
306
+ My idea is to create an interactive "piano" style interface for users, to be able to natrually enter some notes as a basis for the generation.
307
+ Generations are also relavitvely similiar, and I believe this is due to the amount of data trained on.