jamesdon reach-vb HF staff commited on
Commit
0321c8f
0 Parent(s):

Duplicate from reach-vb/musicgen-small-endpoint

Browse files

Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ tags:
4
+ - musicgen
5
+ license: cc-by-nc-4.0
6
+ duplicated_from: facebook/musicgen-small
7
+ ---
8
+
9
+ # MusicGen - Small - 300M
10
+
11
+ MusicGen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts.
12
+ It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz.
13
+ Unlike existing methods, like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass.
14
+ By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio.
15
+
16
+ MusicGen was published in [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by *Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez*.
17
+
18
+ Four checkpoints are released:
19
+ - [**small** (this checkpoint)](https://huggingface.co/facebook/musicgen-small)
20
+ - [medium](https://huggingface.co/facebook/musicgen-medium)
21
+ - [large](https://huggingface.co/facebook/musicgen-large)
22
+ - [melody](https://huggingface.co/facebook/musicgen-melody)
23
+
24
+ ## Example
25
+
26
+ Try out MusicGen yourself!
27
+
28
+ * Audiocraft Colab:
29
+
30
+ <a target="_blank" href="https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing">
31
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
32
+ </a>
33
+
34
+ * Hugging Face Colab:
35
+
36
+ <a target="_blank" href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb">
37
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
38
+ </a>
39
+
40
+ * Hugging Face Demo:
41
+
42
+ <a target="_blank" href="https://huggingface.co/spaces/facebook/MusicGen">
43
+ <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
44
+ </a>
45
+
46
+ ## 🤗 Transformers Usage
47
+
48
+ You can run MusicGen locally with the 🤗 Transformers library from version 4.31.0 onwards.
49
+
50
+ 1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main:
51
+
52
+ ```
53
+ pip install git+https://github.com/huggingface/transformers.git
54
+ ```
55
+
56
+ 2. Run the following Python code to generate text-conditional audio samples:
57
+
58
+ ```py
59
+ from transformers import AutoProcessor, MusicgenForConditionalGeneration
60
+
61
+
62
+ processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
63
+ model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
64
+
65
+ inputs = processor(
66
+ text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
67
+ padding=True,
68
+ return_tensors="pt",
69
+ )
70
+
71
+ audio_values = model.generate(**inputs, max_new_tokens=256)
72
+ ```
73
+
74
+ 3. Listen to the audio samples either in an ipynb notebook:
75
+
76
+ ```py
77
+ from IPython.display import Audio
78
+
79
+ sampling_rate = model.config.audio_encoder.sampling_rate
80
+ Audio(audio_values[0].numpy(), rate=sampling_rate)
81
+ ```
82
+
83
+ Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
84
+
85
+ ```py
86
+ import scipy
87
+
88
+ sampling_rate = model.config.audio_encoder.sampling_rate
89
+ scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())
90
+ ```
91
+
92
+ For more details on using the MusicGen model for inference using the 🤗 Transformers library, refer to the [MusicGen docs](https://huggingface.co/docs/transformers/model_doc/musicgen).
93
+
94
+ ## Audiocraft Usage
95
+
96
+ You can also run MusicGen locally through the original [Audiocraft library]((https://github.com/facebookresearch/audiocraft):
97
+
98
+ 1. First install the [`audiocraft` library](https://github.com/facebookresearch/audiocraft)
99
+ ```
100
+ pip install git+https://github.com/facebookresearch/audiocraft.git
101
+ ```
102
+
103
+ 2. Make sure to have [`ffmpeg`](https://ffmpeg.org/download.html) installed:
104
+ ```
105
+ apt get install ffmpeg
106
+ ```
107
+
108
+ 3. Run the following Python code:
109
+
110
+ ```py
111
+ from audiocraft.models import MusicGen
112
+ from audiocraft.data.audio import audio_write
113
+
114
+ model = MusicGen.get_pretrained("small")
115
+ model.set_generation_params(duration=8) # generate 8 seconds.
116
+
117
+ descriptions = ["happy rock", "energetic EDM"]
118
+
119
+ wav = model.generate(descriptions) # generates 2 samples.
120
+
121
+ for idx, one_wav in enumerate(wav):
122
+ # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
123
+ audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
124
+ ```
125
+
126
+ ## Model details
127
+
128
+ **Organization developing the model:** The FAIR team of Meta AI.
129
+
130
+ **Model date:** MusicGen was trained between April 2023 and May 2023.
131
+
132
+ **Model version:** This is the version 1 of the model.
133
+
134
+ **Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.
135
+
136
+ **Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation][https://arxiv.org/abs/2306.05284].
137
+
138
+ **Citation details**:
139
+ ```
140
+ @misc{copet2023simple,
141
+ title={Simple and Controllable Music Generation},
142
+ author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
143
+ year={2023},
144
+ eprint={2306.05284},
145
+ archivePrefix={arXiv},
146
+ primaryClass={cs.SD}
147
+ }
148
+ ```
149
+
150
+ **License** Code is released under MIT, model weights are released under CC-BY-NC 4.0.
151
+
152
+ **Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.
153
+
154
+ ## Intended use
155
+ **Primary intended use:** The primary use of MusicGen is research on AI-based music generation, including:
156
+
157
+ - Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
158
+ - Generation of music guided by text or melody to understand current abilities of generative AI models by machine learning amateurs
159
+
160
+ **Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.
161
+
162
+ **Out-of-scope use cases** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
163
+
164
+ ## Metrics
165
+
166
+ **Models performance measures:** We used the following objective measure to evaluate the model on a standard music benchmark:
167
+
168
+ - Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish)
169
+ - Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST)
170
+ - CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model
171
+
172
+ Additionally, we run qualitative studies with human participants, evaluating the performance of the model with the following axes:
173
+
174
+ - Overall quality of the music samples;
175
+ - Text relevance to the provided text input;
176
+ - Adherence to the melody for melody-guided music generation.
177
+
178
+ More details on performance measures and human studies can be found in the paper.
179
+
180
+ **Decision thresholds:** Not applicable.
181
+
182
+ ## Evaluation datasets
183
+
184
+ The model was evaluated on the [MusicCaps benchmark](https://www.kaggle.com/datasets/googleai/musiccaps) and on an in-domain held-out evaluation set, with no artist overlap with the training set.
185
+
186
+ ## Training datasets
187
+
188
+ The model was trained using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
189
+
190
+ ## Quantitative analysis
191
+
192
+ More information can be found in the paper [Simple and Controllable Music Generation][arxiv], in the Experimental Setup section.
193
+
194
+ ## Limitations and biases
195
+
196
+ **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
197
+
198
+ **Mitigations:** All vocals have been removed from the data source using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs). The model is therefore not able to produce vocals.
199
+
200
+ **Limitations:**
201
+
202
+ - The model is not able to generate realistic vocals.
203
+ - The model has been trained with English descriptions and will not perform as well in other languages.
204
+ - The model does not perform equally well for all music styles and cultures.
205
+ - The model sometimes generates end of songs, collapsing to silence.
206
+ - It is sometimes difficult to assess what types of text descriptions provide the best generations. Prompt engineering may be required to obtain satisfying results.
207
+
208
+ **Biases:** The source of data is potentially lacking diversity and all music cultures are not equally represented in the dataset. The model may not perform equally well on the wide variety of music genres that exists. The generated samples from the model will reflect the biases from the training data. Further work on this model should include methods for balanced and just representations of cultures, for example, by scaling the training data to be both diverse and inclusive.
209
+
210
+ **Risks and harms:** Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. We believe that providing the code to reproduce the research and train new models will allow to broaden the application to new and more representative data.
211
+
212
+ **Use cases:** Users must be aware of the biases, limitations and risks of the model. MusicGen is a model developed for artificial intelligence research on controllable music generation. As such, it should not be used for downstream applications without further investigation and mitigation of risks.
compression_state_dict.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37d256b525d4117f8bbf790ab448a8e9f4746cd401900fe1ae72e154b1513a30
3
+ size 236001935
config.json ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": null,
3
+ "architectures": [
4
+ "MusicgenForConditionalGeneration"
5
+ ],
6
+ "audio_encoder": {
7
+ "_name_or_path": "facebook/encodec_32khz",
8
+ "add_cross_attention": false,
9
+ "architectures": [
10
+ "EncodecModel"
11
+ ],
12
+ "audio_channels": 1,
13
+ "bad_words_ids": null,
14
+ "begin_suppress_tokens": null,
15
+ "bos_token_id": null,
16
+ "chunk_length_s": null,
17
+ "chunk_size_feed_forward": 0,
18
+ "codebook_dim": 128,
19
+ "codebook_size": 2048,
20
+ "compress": 2,
21
+ "cross_attention_hidden_size": null,
22
+ "decoder_start_token_id": null,
23
+ "dilation_growth_rate": 2,
24
+ "diversity_penalty": 0.0,
25
+ "do_sample": false,
26
+ "early_stopping": false,
27
+ "encoder_no_repeat_ngram_size": 0,
28
+ "eos_token_id": null,
29
+ "exponential_decay_length_penalty": null,
30
+ "finetuning_task": null,
31
+ "forced_bos_token_id": null,
32
+ "forced_eos_token_id": null,
33
+ "hidden_size": 128,
34
+ "id2label": {
35
+ "0": "LABEL_0",
36
+ "1": "LABEL_1"
37
+ },
38
+ "is_decoder": false,
39
+ "is_encoder_decoder": false,
40
+ "kernel_size": 7,
41
+ "label2id": {
42
+ "LABEL_0": 0,
43
+ "LABEL_1": 1
44
+ },
45
+ "last_kernel_size": 7,
46
+ "length_penalty": 1.0,
47
+ "max_length": 20,
48
+ "min_length": 0,
49
+ "model_type": "encodec",
50
+ "no_repeat_ngram_size": 0,
51
+ "norm_type": "weight_norm",
52
+ "normalize": false,
53
+ "num_beam_groups": 1,
54
+ "num_beams": 1,
55
+ "num_filters": 64,
56
+ "num_lstm_layers": 2,
57
+ "num_residual_layers": 1,
58
+ "num_return_sequences": 1,
59
+ "output_attentions": false,
60
+ "output_hidden_states": false,
61
+ "output_scores": false,
62
+ "overlap": null,
63
+ "pad_mode": "reflect",
64
+ "pad_token_id": null,
65
+ "prefix": null,
66
+ "problem_type": null,
67
+ "pruned_heads": {},
68
+ "remove_invalid_values": false,
69
+ "repetition_penalty": 1.0,
70
+ "residual_kernel_size": 3,
71
+ "return_dict": true,
72
+ "return_dict_in_generate": false,
73
+ "sampling_rate": 32000,
74
+ "sep_token_id": null,
75
+ "suppress_tokens": null,
76
+ "target_bandwidths": [
77
+ 2.2
78
+ ],
79
+ "task_specific_params": null,
80
+ "temperature": 1.0,
81
+ "tf_legacy_loss": false,
82
+ "tie_encoder_decoder": false,
83
+ "tie_word_embeddings": true,
84
+ "tokenizer_class": null,
85
+ "top_k": 50,
86
+ "top_p": 1.0,
87
+ "torch_dtype": "float32",
88
+ "torchscript": false,
89
+ "transformers_version": "4.31.0.dev0",
90
+ "trim_right_ratio": 1.0,
91
+ "typical_p": 1.0,
92
+ "upsampling_ratios": [
93
+ 8,
94
+ 5,
95
+ 4,
96
+ 4
97
+ ],
98
+ "use_bfloat16": false,
99
+ "use_causal_conv": false,
100
+ "use_conv_shortcut": false
101
+ },
102
+ "decoder": {
103
+ "_name_or_path": "",
104
+ "activation_dropout": 0.0,
105
+ "activation_function": "gelu",
106
+ "add_cross_attention": false,
107
+ "architectures": null,
108
+ "attention_dropout": 0.0,
109
+ "bad_words_ids": null,
110
+ "begin_suppress_tokens": null,
111
+ "bos_token_id": 2048,
112
+ "chunk_size_feed_forward": 0,
113
+ "classifier_dropout": 0.0,
114
+ "cross_attention_hidden_size": null,
115
+ "decoder_start_token_id": null,
116
+ "diversity_penalty": 0.0,
117
+ "do_sample": false,
118
+ "dropout": 0.1,
119
+ "early_stopping": false,
120
+ "encoder_no_repeat_ngram_size": 0,
121
+ "eos_token_id": null,
122
+ "exponential_decay_length_penalty": null,
123
+ "ffn_dim": 4096,
124
+ "finetuning_task": null,
125
+ "forced_bos_token_id": null,
126
+ "forced_eos_token_id": null,
127
+ "hidden_size": 1024,
128
+ "id2label": {
129
+ "0": "LABEL_0",
130
+ "1": "LABEL_1"
131
+ },
132
+ "initializer_factor": 0.02,
133
+ "is_decoder": false,
134
+ "is_encoder_decoder": false,
135
+ "label2id": {
136
+ "LABEL_0": 0,
137
+ "LABEL_1": 1
138
+ },
139
+ "layerdrop": 0.0,
140
+ "length_penalty": 1.0,
141
+ "max_length": 20,
142
+ "max_position_embeddings": 2048,
143
+ "min_length": 0,
144
+ "model_type": "musicgen_decoder",
145
+ "no_repeat_ngram_size": 0,
146
+ "num_attention_heads": 16,
147
+ "num_beam_groups": 1,
148
+ "num_beams": 1,
149
+ "num_codebooks": 4,
150
+ "num_hidden_layers": 24,
151
+ "num_return_sequences": 1,
152
+ "output_attentions": false,
153
+ "output_hidden_states": false,
154
+ "output_scores": false,
155
+ "pad_token_id": 2048,
156
+ "prefix": null,
157
+ "problem_type": null,
158
+ "pruned_heads": {},
159
+ "remove_invalid_values": false,
160
+ "repetition_penalty": 1.0,
161
+ "return_dict": true,
162
+ "return_dict_in_generate": false,
163
+ "scale_embedding": false,
164
+ "sep_token_id": null,
165
+ "suppress_tokens": null,
166
+ "task_specific_params": null,
167
+ "temperature": 1.0,
168
+ "tf_legacy_loss": false,
169
+ "tie_encoder_decoder": false,
170
+ "tie_word_embeddings": false,
171
+ "tokenizer_class": null,
172
+ "top_k": 50,
173
+ "top_p": 1.0,
174
+ "torch_dtype": null,
175
+ "torchscript": false,
176
+ "transformers_version": "4.31.0.dev0",
177
+ "typical_p": 1.0,
178
+ "use_bfloat16": false,
179
+ "use_cache": true,
180
+ "vocab_size": 2048
181
+ },
182
+ "is_encoder_decoder": true,
183
+ "model_type": "musicgen",
184
+ "text_encoder": {
185
+ "_name_or_path": "t5-base",
186
+ "add_cross_attention": false,
187
+ "architectures": [
188
+ "T5ForConditionalGeneration"
189
+ ],
190
+ "bad_words_ids": null,
191
+ "begin_suppress_tokens": null,
192
+ "bos_token_id": null,
193
+ "chunk_size_feed_forward": 0,
194
+ "cross_attention_hidden_size": null,
195
+ "d_ff": 3072,
196
+ "d_kv": 64,
197
+ "d_model": 768,
198
+ "decoder_start_token_id": 0,
199
+ "dense_act_fn": "relu",
200
+ "diversity_penalty": 0.0,
201
+ "do_sample": false,
202
+ "dropout_rate": 0.1,
203
+ "early_stopping": false,
204
+ "encoder_no_repeat_ngram_size": 0,
205
+ "eos_token_id": 1,
206
+ "exponential_decay_length_penalty": null,
207
+ "feed_forward_proj": "relu",
208
+ "finetuning_task": null,
209
+ "forced_bos_token_id": null,
210
+ "forced_eos_token_id": null,
211
+ "id2label": {
212
+ "0": "LABEL_0",
213
+ "1": "LABEL_1"
214
+ },
215
+ "initializer_factor": 1.0,
216
+ "is_decoder": false,
217
+ "is_encoder_decoder": true,
218
+ "is_gated_act": false,
219
+ "label2id": {
220
+ "LABEL_0": 0,
221
+ "LABEL_1": 1
222
+ },
223
+ "layer_norm_epsilon": 1e-06,
224
+ "length_penalty": 1.0,
225
+ "max_length": 20,
226
+ "min_length": 0,
227
+ "model_type": "t5",
228
+ "n_positions": 512,
229
+ "no_repeat_ngram_size": 0,
230
+ "num_beam_groups": 1,
231
+ "num_beams": 1,
232
+ "num_decoder_layers": 12,
233
+ "num_heads": 12,
234
+ "num_layers": 12,
235
+ "num_return_sequences": 1,
236
+ "output_attentions": false,
237
+ "output_hidden_states": false,
238
+ "output_past": true,
239
+ "output_scores": false,
240
+ "pad_token_id": 0,
241
+ "prefix": null,
242
+ "problem_type": null,
243
+ "pruned_heads": {},
244
+ "relative_attention_max_distance": 128,
245
+ "relative_attention_num_buckets": 32,
246
+ "remove_invalid_values": false,
247
+ "repetition_penalty": 1.0,
248
+ "return_dict": true,
249
+ "return_dict_in_generate": false,
250
+ "sep_token_id": null,
251
+ "suppress_tokens": null,
252
+ "task_specific_params": {
253
+ "summarization": {
254
+ "early_stopping": true,
255
+ "length_penalty": 2.0,
256
+ "max_length": 200,
257
+ "min_length": 30,
258
+ "no_repeat_ngram_size": 3,
259
+ "num_beams": 4,
260
+ "prefix": "summarize: "
261
+ },
262
+ "translation_en_to_de": {
263
+ "early_stopping": true,
264
+ "max_length": 300,
265
+ "num_beams": 4,
266
+ "prefix": "translate English to German: "
267
+ },
268
+ "translation_en_to_fr": {
269
+ "early_stopping": true,
270
+ "max_length": 300,
271
+ "num_beams": 4,
272
+ "prefix": "translate English to French: "
273
+ },
274
+ "translation_en_to_ro": {
275
+ "early_stopping": true,
276
+ "max_length": 300,
277
+ "num_beams": 4,
278
+ "prefix": "translate English to Romanian: "
279
+ }
280
+ },
281
+ "temperature": 1.0,
282
+ "tf_legacy_loss": false,
283
+ "tie_encoder_decoder": false,
284
+ "tie_word_embeddings": true,
285
+ "tokenizer_class": null,
286
+ "top_k": 50,
287
+ "top_p": 1.0,
288
+ "torch_dtype": null,
289
+ "torchscript": false,
290
+ "transformers_version": "4.31.0.dev0",
291
+ "typical_p": 1.0,
292
+ "use_bfloat16": false,
293
+ "use_cache": true,
294
+ "vocab_size": 32128
295
+ },
296
+ "torch_dtype": "float32",
297
+ "transformers_version": null
298
+ }
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 2048,
4
+ "decoder_start_token_id": 2048,
5
+ "do_sample": true,
6
+ "guidance_scale": 3.0,
7
+ "max_length": 1500,
8
+ "pad_token_id": 2048,
9
+ "transformers_version": "4.31.0.dev0"
10
+ }
handler.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, List, Any
2
+ from transformers import AutoProcessor, MusicgenForConditionalGeneration
3
+ import torch
4
+
5
+ class EndpointHandler:
6
+ def __init__(self, path=""):
7
+ # load model and processor from path
8
+ self.processor = AutoProcessor.from_pretrained(path)
9
+ self.model = MusicgenForConditionalGeneration.from_pretrained(path).to("cuda")
10
+
11
+ def __call__(self, data: Dict[str, Any]) -> Dict[str, str]:
12
+ """
13
+ Args:
14
+ data (:dict:):
15
+ The payload with the text prompt and generation parameters.
16
+ """
17
+ # process input
18
+ inputs = data.pop("inputs", data)
19
+ parameters = data.pop("parameters", None)
20
+
21
+ # preprocess
22
+ inputs = self.processor(
23
+ text=[inputs],
24
+ padding=True,
25
+ return_tensors="pt",).to("cuda")
26
+
27
+ # pass inputs with all kwargs in data
28
+ if parameters is not None:
29
+ outputs = self.model.generate(**inputs, max_new_tokens=256, **parameters)
30
+ else:
31
+ outputs = self.model.generate(**inputs, max_new_tokens=256)
32
+
33
+ # postprocess the prediction
34
+ prediction = outputs[0].cpu().numpy()
35
+
36
+ return [{"generated_audio": prediction}]
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length_s": null,
3
+ "feature_extractor_type": "EncodecFeatureExtractor",
4
+ "feature_size": 1,
5
+ "overlap": null,
6
+ "padding_side": "left",
7
+ "padding_value": 0.0,
8
+ "processor_class": "MusicgenProcessor",
9
+ "return_attention_mask": true,
10
+ "sampling_rate": 32000
11
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab9fcfa2ec342adbf3e4e893c70eadcafe83301b61da023b4493e6bdc43690d3
3
+ size 2364555677
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ transformers==4.31.0
2
+ accelerate>=0.20.3
special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
state_dict.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:830fe5771e2835cbd305f3ffdda22d3ff38718fbbe8274bd19dc203265de7244
3
+ size 840843863
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "clean_up_tokenization_spaces": true,
105
+ "eos_token": "</s>",
106
+ "extra_ids": 100,
107
+ "model_max_length": 512,
108
+ "pad_token": "<pad>",
109
+ "processor_class": "MusicgenProcessor",
110
+ "tokenizer_class": "T5Tokenizer",
111
+ "unk_token": "<unk>"
112
+ }