Model Card for Diva Llama 3
This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details will be in a paper [COMING SOON]!
See the model in action compared to SALMONN and Qwen-Audio at value-nlp.github.io/DiVA-Demo.
Citation
No Publication As of Yet, But If You Use Please Cite the Below BibTeX:
@misc{held2024diva,
author="Held, Will and Zhang, Yanzhe and Ryan, Michael and Shi, Weiyan and Li, Ella and Yang, Diyi",
title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
year="2024",
publisher="HuggingFace",
}
Table of Contents
- Model Card for DiVA Llama 3
- Citation
- Table of Contents
- Training Details
- Environmental Impact
- Technical Specifications [optional]
- Model Card Contact
Training Details
Training Data
This model was trained on the CommonVoice corpus.
Training Procedure
This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.
Environmental Impact
- Hardware Type: V4-32 TPU
- Hours used: 8 Hours
- Cloud Provider: Google Cloud.
- Compute Region: US Central C
Hardware
This model was trained on at V4 TPU on Google Cloud.
Software
This model was trained with Levanter
Model Card Authors [optional]
Will Held