|
|
|
## DeSTA2 |
|
|
|
[π Paper](https://arxiv.org/pdf/2409.20007) | [π Website](https://kehanlu.github.io/DeSTA2/) | [π©βπ» Github](https://github.com/kehanlu/DeSTA2) | [π€ Model](https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta) | [π€ Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct) | |
|
|
|
|
|
## Quickstart |
|
|
|
```python |
|
|
|
from huggingface import AutoModel |
|
|
|
HF_TOKEN = "hf_..." # your huggingface token for downloading Llama3 from official Meta repo |
|
|
|
model = AutoModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", trust_remote_code=True, token=HF_TOKEN) |
|
|
|
messages = [ |
|
{"role": "system", "content": "You are a helpful voice assistant."}, |
|
{"role": "audio", "content": "<path_to_audio_file>"}, |
|
{"role": "user", "content": "Describe the audio."} |
|
] |
|
|
|
generated_ids = model.chat( |
|
messages, |
|
max_new_tokens=128, |
|
do_sample=True, |
|
temperature=0.6, |
|
top_p=0.9 |
|
) |
|
|
|
response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(response) |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
if you find our work useful, please consider citing the paper: |
|
|
|
``` |
|
@article{lu2024developing, |
|
title={Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data}, |
|
author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi}, |
|
journal={arXiv preprint arXiv:2409.20007}, |
|
year={2024} |
|
} |
|
|
|
@inproceedings{lu24c_interspeech, |
|
title = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment}, |
|
author = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee}, |
|
year = {2024}, |
|
booktitle = {Interspeech 2024}, |
|
pages = {4159--4163}, |
|
doi = {10.21437/Interspeech.2024-457}, |
|
issn = {2958-1796}, |
|
} |
|
``` |