m-a-p
/

MusiLingo-long-v1

@@ -1,199 +1,124 @@
 ---
-library_name: transformers
-tags: []
 ---
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: cc-by-4.0
+language:
+- en
+tags:
+- music
+- art
 ---
 # Model Card for Model ID
 ## Model Details
 ### Model Description
+The model consists of a music encoder ```MERT-v1-300M```, a natural language decoder ```vicuna-7b-delta-v0```, and a linear projection laer between the two.
+This checkpoint of MusiLingo is developed on the MusicInstruct (MI)-long and can answer long instructions with music raw audio, such as querying about the subjective feelings etc.
+You can use the [MI](https://huggingface.co/datasets/m-a-p/Music-Instruct) dataset for the following demo
 ### Model Sources [optional]
+- **Repository:** [GitHub repo](https://github.com/zihaod/MusiLingo)
+- **Paper [optional]:** __[MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response](https://arxiv.org/abs/2309.08730)__
+<!-- - **Demo [optional]:** [More Information Needed] -->
+## Getting Start
+```
+from tqdm.auto import tqdm
+import torch
+from torch.utils.data import DataLoader
+from transformers import Wav2Vec2FeatureExtractor
+from transformers import StoppingCriteria, StoppingCriteriaList
+class StoppingCriteriaSub(StoppingCriteria):
+    def __init__(self, stops=[], encounters=1):
+        super().__init__()
+        self.stops = stops
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
+        for stop in self.stops:
+            if torch.all((stop == input_ids[0][-len(stop):])).item():
+                return True
+        return False
+def answer(self, samples, stopping, max_new_tokens=300, num_beams=1, min_length=1, top_p=0.5,
+        repetition_penalty=1.0, length_penalty=1, temperature=0.1, max_length=2000):
+    audio = samples["audio"].cuda()
+    audio_embeds, atts_audio = self.encode_audio(audio)
+    if 'instruction_input' in samples:  # instruction dataset
+        #print('Instruction Batch')
+        instruction_prompt = []
+        for instruction in samples['instruction_input']:
+            prompt = '<Audio><AudioHere></Audio> ' + instruction
+            instruction_prompt.append(self.prompt_template.format(prompt))
+        audio_embeds, atts_audio = self.instruction_prompt_wrap(audio_embeds, atts_audio, instruction_prompt)
+    self.llama_tokenizer.padding_side = "right"
+    batch_size = audio_embeds.shape[0]
+    bos = torch.ones([batch_size, 1],
+                    dtype=torch.long,
+                    device=torch.device('cuda')) * self.llama_tokenizer.bos_token_id
+    bos_embeds = self.llama_model.model.embed_tokens(bos)
+    atts_bos = atts_audio[:, :1]
+    inputs_embeds = torch.cat([bos_embeds, audio_embeds], dim=1)
+    attention_mask = torch.cat([atts_bos, atts_audio], dim=1)
+    outputs = self.llama_model.generate(
+        inputs_embeds=inputs_embeds,
+        max_new_tokens=max_new_tokens,
+        stopping_criteria=stopping,
+        num_beams=num_beams,
+        do_sample=True,
+        min_length=min_length,
+        top_p=top_p,
+        repetition_penalty=repetition_penalty,
+        length_penalty=length_penalty,
+        temperature=temperature,
+    )
+    output_token = outputs[0]
+    if output_token[0] == 0:  # the model might output a unknow token <unk> at the beginning. remove it
+        output_token = output_token[1:]
+    if output_token[0] == 1:  # if there is a start token <s> at the beginning. remove it
+        output_token = output_token[1:]
+    output_text = self.llama_tokenizer.decode(output_token, add_special_tokens=False)
+    output_text = output_text.split('###')[0]  # remove the stop sign '###'
+    output_text = output_text.split('Assistant:')[-1].strip()
+    return output_text
+processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)
+ds = CMIDataset(processor, 'path/to/MI_dataset', 'test', question_type='long')
+dl = DataLoader(
+                ds,
+                batch_size=1,
+                num_workers=0,
+                pin_memory=True,
+                shuffle=False,
+                drop_last=True,
+                collate_fn=ds.collater
+                )
+stopping = StoppingCriteriaList([StoppingCriteriaSub([torch.tensor([835]).cuda(),
+                                torch.tensor([2277, 29937]).cuda()])])
+from transformers import AutoModel
+model_long = AutoModel.from_pretrained("m-a-p/MusiLingo-long-v1")
+for idx, sample in tqdm(enumerate(dl)):
+    ans = answer(Musilingo_long.model, sample, stopping, length_penalty=100, temperature=0.1)
+    txt = sample['text_input'][0]
+    print(txt)
+    print(and)
+```
+# Citing This Work
+If you find the work useful for your research, please consider citing it using the following BibTeX entry:
+```
+@inproceedings{deng2024musilingo,
+  title={MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response},
+  author={Deng, Zihao and Ma, Yinghao and Liu, Yudong and Guo, Rongchen and Zhang, Ge and Chen, Wenhu and Huang, Wenhao and Benetos, Emmanouil},
+  booktitle={Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)},
+  year={2024},
+  organization={Association for Computational Linguistics}
+}
+```