README.md · rishitdagli/see-2-sound at main

metadata

metrics:
  - MFCC-DTW
  - ZCR
  - Chroma Score
  - Spectral Score
model-index:
  - name: SEE-2-SOUND
    results:
      - task:
          type: spatial-audio-generation
          name: Spatial Audio Generation
        dataset:
          type: rishitdagli/see-2-sound-eval
          name: SEE-2-SOUND Evaluation Dataset
        metrics:
          - type: MFCC-DTW
            value: 0.03 × 10^-3
            name: >-
              AViTAR Marginal Scene Guidance - Mel-Frequency Cepstral
              Coefficient - Dynamic Time Warping
          - type: ZCR
            value: 0.95
            name: AViTAR Marginal Scene Guidance - Zero Crossing Rate
          - type: Chroma
            value: 0.77
            name: Chroma Feature
          - type: Spectral Score
            value: 0.95
            name: AViTAR Marginal Scene Guidance - Spectral Score
        source:
          name: arXiv
          url: https://arxiv.org/abs/2406.06612
tags:
  - vision
  - audio
  - spatial audio
  - audio generation
  - music
  - art

SEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound

Rishit Dagli¹ · Shivesh Prakash¹ · Rupert Wu¹ · Houman Khosravani^1,2,3

¹University of Toronto ²Temerty Centre for Artificial Intelligence Research and Education in Medicine ³Sunnybrook Research Institute

This work presents SEE-2-SOUND, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our website to view some results of this work.

These checkpoints are meant to be used with our code: SEE-2-SOUND.

Installation

First, install the pip package and download these checkpoints (needs Git LFS):

pip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound
git clone https://huggingface.co/rishitdagli/see-2-sound
cd see-2-sound

View the full installation instructions as well a tips on dependencies in the repository README.

Running the Models

Now, we can start by making a configuration file, make a file called config.yaml:

codi_encoder: 'codi/codi_encoder.pth'
codi_text: 'codi/codi_text.pth'
codi_audio: 'codi/codi_audio.pth'
codi_video: 'codi/codi_video.pth'

sam: 'sam/sam.pth'
# H, L or B in decreasing performance
sam_size: 'H'

depth: '/depth/depth.pth'
# L, B, or S in decreasing performance
depth_size: 'L'

download: False

# Change to True if your GPU has < 40 GB vRAM
low_mem: False
fp16: False
gpu: True
steps: 500
num_audios: 3
prompt: ''
verbose: True

Now, we can start running inference:

import see2sound


config_file_path = "config.yaml"

model = see2sound.See2Sound(config_path = config_file_path)
model.setup()
model.run(path = "test.png", output_path = "test.wav")

More Information

Feel free to take a look at the full dcoumentation for extra information and tips on running the model.