inference example

by eschmidbauer - opened Jul 26, 2024

Jul 26, 2024

Hello, thank you for sharing this code!
Do you have example inference code ? id like to test this model on my own server.

WillHeld

Owner Jul 27, 2024

•

edited Oct 5, 2024

Of course! You'll need pip install transformers librosa accelerate wget

from transformers import AutoModel
import librosa
import wget

filename = wget.download("https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav")

speech_data, _ = librosa.load(filename, sr=16_000)

model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3-v0-8b", trust_remote_code=True)

output = model.generate(audio=speech_data, text_prompt="Respond like a pirate!")

WillHeld

Owner Jul 27, 2024

I tested that on a Google Cloud 40GB A100 (same hardware we are hosting the demo on for now), but ymmv on other hardware. I'm just relying on HuggingFace accelerate for most of the distribution across accelerators!

Lmk if you hit snags with that, happy to help change stuff to make it more straightforward!

eschmidbauer

Jul 29, 2024

one step missing is:
os.environ["HF_TOKEN"] = "********"

Because the meta-llama/Meta-Llama-3-8B-Instruct model is gated and it seems DiVA-llama-3-v0-8b downloads it during the steps above

WillHeld

Owner Jul 29, 2024

•

edited Jul 29, 2024

Ah, one option Id recommend is to use huggingface-cli login instead of adding the token to your code! This will persist your access token so you don't need to add your token to multiple scripts (and decreases risk of accidentally pushing a secret with your code).

https://huggingface.co/docs/huggingface_hub/en/guides/cli

Guilherme34

Jul 31, 2024

the download of the audio is not working, this is for clone the voice? like i can use my own audio in english?

WillHeld

Owner Aug 1, 2024

•

edited Aug 7, 2024

Hi Guilherme,

No, DiVA is not currently a text to speech model and we have no plans to support voice cloning. It takes speech as input and replies conversationally with text.

If you are looking for text-to-speech, you may consider looking at TTS initiatives likehttps://github.com/collabora/WhisperSpeech or https://huggingface.co/parler-tts if that is your interest!

Respair

Aug 6, 2024

•

edited Aug 6, 2024

@WillHeld
Sorry I have two questions I was wondering if you could answer.

1- Should I quantize it using B&B or does it degrade performance and pplx by a large margin? I ask because quantization usually don't take kindly to models equipped with an Encoder based on my experience.
2- and may I ask you to do a notebook on fine-tuning this model (full-parameter finetuning if possible, since the model itself isn't too big, but PEFT is also appreciated) hopefully using hf trainer or PyTorch this time around. if you have the opportunity to do so?

I really appreciate it since this is such an interesting work.

WillHeld

Owner Aug 6, 2024

•

edited Aug 6, 2024

Hi!

On 1) I haven't tried any quantization myself so don't have great signal on this! It seems people quantize Whisper, so if I were to guess it's possible without too much degradation but I really don't know.

On 2) You can find all the training code here: https://github.com/Helw150/levanter/tree/will/distill but as you hinted at its all in Jax. Levanter supports LoRA as well for PEFT, so the functionality is all there.

Unfortunately, I don't have it in my roadmap to reproduce the full training stack in PyTorch since I rely on the TPU Research Cloud for my compute resources. Jax is much better supported there & models from Levanter are exported to the safetensors format so is easily usable from other frameworks at inference time. If you want an out of the box training solution, I'd suggest using Levanter (it supports GPU as well). Here's a doc on how to get set up for audio: https://levanter.readthedocs.io/en/latest/tutorials/Training-On-Audio-Data/

If PyTorch training is a must, you should be able to place the PyTorch conversion of DiVA here into any HuggingFace/PyTorch trainer loop though! Everything in modeling_diva.py is PyTorch and differentiable so will work with standard forward + backward passes.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment