Files changed (1) hide show
  1. README.md +18 -8
README.md CHANGED
@@ -132,9 +132,9 @@ for idx, one_wav in enumerate(wav):
132
 
133
  **Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.
134
 
135
- **Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation][https://arxiv.org/abs/2306.05284].
136
 
137
- **Citation details**:
138
  ```
139
  @misc{copet2023simple,
140
  title={Simple and Controllable Music Generation},
@@ -146,11 +146,12 @@ for idx, one_wav in enumerate(wav):
146
  }
147
  ```
148
 
149
- **License** Code is released under MIT, model weights are released under CC-BY-NC 4.0.
150
 
151
  **Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.
152
 
153
  ## Intended use
 
154
  **Primary intended use:** The primary use of MusicGen is research on AI-based music generation, including:
155
 
156
  - Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
@@ -158,7 +159,7 @@ for idx, one_wav in enumerate(wav):
158
 
159
  **Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.
160
 
161
- **Out-of-scope use cases** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
162
 
163
  ## Metrics
164
 
@@ -184,17 +185,26 @@ The model was evaluated on the [MusicCaps benchmark](https://www.kaggle.com/data
184
 
185
  ## Training datasets
186
 
187
- The model was trained using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
 
 
 
 
188
 
189
- ## Quantitative analysis
 
 
 
 
 
190
 
191
- More information can be found in the paper [Simple and Controllable Music Generation][arxiv], in the Experimental Setup section.
192
 
193
  ## Limitations and biases
194
 
195
  **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
196
 
197
- **Mitigations:** All vocals have been removed from the data source using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs). The model is therefore not able to produce vocals.
198
 
199
  **Limitations:**
200
 
 
132
 
133
  **Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.
134
 
135
+ **Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284).
136
 
137
+ **Citation details:**
138
  ```
139
  @misc{copet2023simple,
140
  title={Simple and Controllable Music Generation},
 
146
  }
147
  ```
148
 
149
+ **License:** Code is released under MIT, model weights are released under CC-BY-NC 4.0.
150
 
151
  **Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.
152
 
153
  ## Intended use
154
+
155
  **Primary intended use:** The primary use of MusicGen is research on AI-based music generation, including:
156
 
157
  - Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
 
159
 
160
  **Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.
161
 
162
+ **Out-of-scope use cases:** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
163
 
164
  ## Metrics
165
 
 
185
 
186
  ## Training datasets
187
 
188
+ The model was trained on licensed data using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
189
+
190
+ ## Evaluation results
191
+
192
+ Below are the objective metrics obtained on MusicCaps with the released model. Note that for the publicly released models, we had all the datasets go through a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs), in order to keep only the instrumental part. This explains the difference in objective metrics with the models used in the paper.
193
 
194
+ | Model | Frechet Audio Distance | KLD | Text Consistency | Chroma Cosine Similarity |
195
+ |---|---|---|---|---|
196
+ | facebook/musicgen-small | 4.88 | 1.42 | 0.27 | - |
197
+ | **facebook/musicgen-medium** | 5.14 | 1.38 | 0.28 | - |
198
+ | facebook/musicgen-large | 5.48 | 1.37 | 0.28 | - |
199
+ | facebook/musicgen-melody | 4.93 | 1.41 | 0.27 | 0.44 |
200
 
201
+ More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284), in the Results section.
202
 
203
  ## Limitations and biases
204
 
205
  **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
206
 
207
+ **Mitigations:** Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs).
208
 
209
  **Limitations:**
210