File size: 5,476 Bytes
ba4f4c1
 
 
 
 
 
 
 
 
 
 
 
 
 
f252ce9
 
 
 
 
 
 
 
 
 
 
 
d71d97d
8907591
8638fad
 
 
 
 
8907591
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a743b9
 
 
 
 
 
 
 
 
f252ce9
8907591
 
 
 
f252ce9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8849fc5
f252ce9
 
 
 
8849fc5
f252ce9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba4f4c1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
datasets:
- bpiyush/sound-of-water
language:
- en
base_model:
- facebook/wav2vec2-base-960h
pipeline_tag: audio-classification
tags:
- physical-property-estimation
- audio-visual
- pouring-water
---
# 🚰 The Sound of Water: Inferring Physical Properties from Pouring Liquids

In this folder, we provide the following trained model checkpoints:

<p align="center">
  <img src="./assets/pitch_on_spectrogram-compressed.gif" alt="Teaser" width="100%">
</p>

*Key insight*: As water is poured, the fundamental frequency that we hear changes predictably over time as a function of physical properties (e.g., container dimensions).

**TL;DR**: We present a method to infer physical properties of liquids from *just* the sound of pouring. We show in theory how *pitch* can be used to derive various physical properties such as container height, flow rate, etc. Then, we train a pitch detection network (`wav2vec2`) using simulated and real data. The resulting model can predict the physical properties of pouring liquids with high accuracy. The latent representations learned also encode information about liquid mass and container shape.

Arxiv link: https://arxiv.org/abs/2411.11222

## Demo

Check out the demo [here](https://huggingface.co/spaces/bpiyush/SoundOfWater). You can upload a video of pouring and the model estimates pitch and physical properties.


## πŸ’» Usage 

First, install the repository from `github`.

```sh
git clone git@github.com:bpiyush/SoundOfWater.git
cd SoundOfWater
```

Then, install dependencies.

```sh
conda create -n sow python=3.8
conda activate sow

# Install desired torch version
# NOTE: change the version if you are using a different CUDA version
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

# Additional packages
pip install lightning==2.1.2
pip install timm==0.9.10
pip install pandas
pip install decord==0.6.0
pip install librosa==0.10.1
pip install einops==0.7.0
pip install ipywidgets jupyterlab seaborn

# if you find a package is missing, please install it with pip
```

Then, use this snippet to download the models:

```python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="bpiyush/sound-of-water-models",
    local_dir="/path/to/download/",
)
```

To run our models on examples of pouring sounds, please see the [playground notebook](https://github.com/bpiyush/SoundOfWater/blob/main/playground.ipynb).

If you would like to use our dataset for a different task, please download it from [here](https://huggingface.co/datasets/bpiyush/sound-of-water).

## Models

We provide audio models trained to detect pitch in the sound of pouring water.
We train these models in two stages:

1. **Pre-training on synthetic data**: We simulate sounds of pouring water using [DDSP](https://arxiv.org/abs/2001.04643) using only 80 samples. This is used to generate lots of simulated sounds of pouring water. Then, we train `wav2vec2` on this data.
2. **Fine-tuning on real data**: We fine-tune the model on real data. Since real data does not come with ground truth, we use visual co-supervision from the video stream to fine-tune the audio model.

Here, we provide checkpoints for both the stages.

<table style="font-size: 12px;" class="center">
  <tr>
    <th><b> File name </b></th>
    <th><b> Description </b></th>
    <th><b> Size </b></th>
  </tr>
  <tr>
    <td><a href="https://huggingface.co/bpiyush/sound-of-water-models">dsr9mf13_ep100_step12423_synthetic_pretrained.pth</a></td>
    <td>Pre-trained on synthetic data</td>
    <td>361M</td>
  </tr>
  <tr>
    <td><a href="https://huggingface.co/bpiyush/sound-of-water-models">dsr9mf13_ep100_step12423_real_finetuned_with_cosupervision.pth</a></td>
    <td>Trained with visual co-supervision</td>
    <td>361M</td>
  </tr>
</table>


<!-- Add a citation -->
## πŸ“œ Citation

If you find this repository useful, please consider giving a star ⭐ and citation

```bibtex
@article{sound_of_water_bagad,
  title={The Sound of Water: Inferring Physical Properties from Pouring Liquids},
  author={Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M. and Zisserman, Andrew},
  journal={arXiv},
  year={2024}
}
```

<!-- Add acknowledgements, license, etc. here. -->
## πŸ™ Acknowledgements

* We thank Ashish Thandavan for support with infrastructure and Sindhu
Hegde, Ragav Sachdeva, Jaesung Huh, Vladimir Iashin, Prajwal KR, and Aditya Singh for useful
discussions.
* This research is funded by EPSRC Programme Grant VisualAI EP/T028572/1, and a Royal Society Research Professorship RP / R1 / 191132.

We also want to highlight closely related work that could be of interest:

* [Analyzing Liquid Pouring Sequences via Audio-Visual Neural Networks](https://gamma.cs.unc.edu/PSNN/). IROS (2019).
* [Human sensitivity to acoustic information from vessel filling](https://psycnet.apa.org/record/2000-13210-019). Journal of Experimental Psychology (2020).
* [See the Glass Half Full: Reasoning About Liquid Containers, Their Volume and Content](https://arxiv.org/abs/1701.02718). ICCV (2017).
* [CREPE: A Convolutional Representation for Pitch Estimation](https://arxiv.org/abs/1802.06182). ICASSP (2018).

## πŸ™…πŸ» Potential Biases

Our model is based on `wav2vec2` which is trained on a large-scale speech recognition data. While this data is not as large as usual datasets in AI, it may still have undesirable biases that are present in the training data.