metrics:
- MFCC-DTW
- ZCR
- Chroma Score
- Spectral Score
model-index:
- name: SEE-2-SOUND
results:
- task:
type: spatial-audio-generation
name: Spatial Audio Generation
dataset:
type: rishitdagli/see-2-sound-eval
name: SEE-2-SOUND Evaluation Dataset
metrics:
- type: MFCC-DTW
value: 0.03 × 10^-3
name: >-
AViTAR Marginal Scene Guidance - Mel-Frequency Cepstral
Coefficient - Dynamic Time Warping
- type: ZCR
value: 0.95
name: AViTAR Marginal Scene Guidance - Zero Crossing Rate
- type: Chroma
value: 0.77
name: Chroma Feature
- type: Spectral Score
value: 0.95
name: AViTAR Marginal Scene Guidance - Spectral Score
source:
name: arXiv
url: https://arxiv.org/abs/2406.06612
tags:
- vision
- audio
- spatial audio
- audio generation
- music
- art
SEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound
Rishit Dagli1 · Shivesh Prakash1 · Rupert Wu1 · Houman Khosravani1,2,3
1University of Toronto 2Temerty Centre for Artificial Intelligence Research and Education in Medicine 3Sunnybrook Research Institute
This work presents SEE-2-SOUND, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our website to view some results of this work.
These checkpoints are meant to be used with our code: SEE-2-SOUND.
Installation
First, install the pip package and download these checkpoints (needs Git LFS):
pip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound
git clone https://huggingface.co/rishitdagli/see-2-sound
cd see-2-sound
View the full installation instructions as well a tips on dependencies in the repository README.
Running the Models
Now, we can start by making a configuration file, make a file called config.yaml
:
codi_encoder: 'codi/codi_encoder.pth'
codi_text: 'codi/codi_text.pth'
codi_audio: 'codi/codi_audio.pth'
codi_video: 'codi/codi_video.pth'
sam: 'sam/sam.pth'
# H, L or B in decreasing performance
sam_size: 'H'
depth: '/depth/depth.pth'
# L, B, or S in decreasing performance
depth_size: 'L'
download: False
# Change to True if your GPU has < 40 GB vRAM
low_mem: False
fp16: False
gpu: True
steps: 500
num_audios: 3
prompt: ''
verbose: True
Now, we can start running inference:
import see2sound
config_file_path = "config.yaml"
model = see2sound.See2Sound(config_path = config_file_path)
model.setup()
model.run(path = "test.png", output_path = "test.wav")
More Information
Feel free to take a look at the full dcoumentation for extra information and tips on running the model.