# Supplementary Material
## Training and test data
We provide a [website](https://zju3dv.github.io/zju_mocap/) for visualization.
The multi-view videos are captured by 23 cameras. We train our model on the "0, 6, 12, 18" cameras and test it on the remaining cameras.
The following table shows the detailed frame numbers for training and test of each video. Since the video length of each subject is different, we choose the appropriate number of frames for training and test.
**Note that since rendering is very slow, we test our model every 30 frames. For example, although the frame range of video 313 is "0-59", we only test our model on the 0-th and 30-th frames.**
| Video | 313 | 315 | 377 | 386 | 387 | 390 | 392 | 393 | 394 |
| :-----: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Number of frames | 1470 | 2185 | 617 | 646 | 654 | 1171 | 556 | 658 | 859 |
| Frame Range (Training) | 0-59 | 0-399 | 0-299 | 0-299 | 0-299 | 700-999 | 0-299 | 0-299 | 0-299 |
| Frame Range (Unseen human poses) | 60-1060 | 400-1400 | 300-617 | 300-646 | 300-654 | 0-700 | 300-556 | 300-658 | 300-859 |
## Evaluation metrics
**We save our rendering results on novel views of training frames and unseen human poses at [here](https://zjueducn-my.sharepoint.com/:u:/g/personal/pengsida_zju_edu_cn/Ea3VOUy204VAiVJ-V-OGd9YBxdhbtfpS-U6icD_rDq0mUQ?e=cAcylK).**
As described in the paper, we evaluate our model in terms of the PSNR and SSIM metrics.
A straightforward way for evaluation is calculating the metrics on the whole image. Since we already know the 3D bounding box of the target human, we can project the 3D box to obtain a `bound_mask` and make the colors of pixels outside the mask as zero, as shown in the following figure.
![fig](https://zju3dv.github.io/neuralbody/images/bound_mask.png)
As a result, the PSNR and SSIM metrics appear very high performances, as shown in the following table.
|
Training frames |
Unseen human poses |
|
PSNR |
SSIM |
PSNR |
SSIM |
313 |
35.21 |
0.985 |
29.02 |
0.964 |
315 |
33.07 |
0.988 |
25.70 |
0.957 |
392 |
35.76 |
0.984 |
31.53 |
0.971 |
393 |
33.24 |
0.979 |
28.40 |
0.960 |
394 |
34.31 |
0.980 |
29.61 |
0.961 |
377 |
33.86 |
0.985 |
30.60 |
0.977 |
386 |
36.07 |
0.984 |
33.05 |
0.974 |
390 |
34.48 |
0.980 |
30.25 |
0.964 |
387 |
31.39 |
0.975 |
27.68 |
0.961 |
|
34.15 |
0.982 |
29.54 |
0.966 |
To overcome this problem, a solution is only calculating the metrics on pixels inside the `bound_mask`. Since the SSIM metric requires the input to have the image format, we first compute the 2D box that bounds the `bound_mask` and then crop the corresponding image region.
```python
def ssim_metric(rgb_pred, rgb_gt, batch):
mask_at_box = batch['mask_at_box'][0].detach().cpu().numpy()
H, W = int(cfg.H * cfg.ratio), int(cfg.W * cfg.ratio)
mask_at_box = mask_at_box.reshape(H, W)
# convert the pixels into an image
img_pred = np.zeros((H, W, 3))
img_pred[mask_at_box] = rgb_pred
img_gt = np.zeros((H, W, 3))
img_gt[mask_at_box] = rgb_gt
# crop the object region
x, y, w, h = cv2.boundingRect(mask_at_box.astype(np.uint8))
img_pred = img_pred[y:y + h, x:x + w]
img_gt = img_gt[y:y + h, x:x + w]
# compute the ssim
ssim = compare_ssim(img_pred, img_gt, multichannel=True)
return ssim
```
The following table lists corresponding results.
|
Training frames |
Unseen human poses |
|
PSNR |
SSIM |
PSNR |
SSIM |
313 |
30.56 |
0.971 |
23.95 |
0.905 |
315 |
27.24 |
0.962 |
19.56 |
0.852 |
392 |
29.44 |
0.946 |
25.76 |
0.909 |
394 |
28.44 |
0.940 |
23.80 |
0.878 |
393 |
27.58 |
0.939 |
23.25 |
0.893 |
377 |
27.64 |
0.951 |
23.91 |
0.909 |
386 |
28.60 |
0.931 |
25.68 |
0.881 |
387 |
25.79 |
0.928 |
21.60 |
0.870 |
390 |
27.59 |
0.926 |
23.90 |
0.870 |
|
28.10 |
0.944 |
23.49 |
0.885 |
## Results of other methods on ZJU-MoCap
We save rendering results of other methods on novel views of training frames and unseen human poses at [here](https://zjueducn-my.sharepoint.com/:u:/g/personal/pengsida_zju_edu_cn/EQaPRQww70NDqEXeSG-fOeAB5JXFSWiWDW223h5nmkHvwQ?e=mdofbl), including Neural Volumes, Multi-view Neural Human Rendering, and Deferred Neural Human Rendering. **Note that we only generate novel views of training frames for Neural Volumes.**
The following table lists quantitative results of Neural Volumes.
|
PSNR |
SSIM |
313 |
20.09 |
0.831 |
315 |
18.57 |
0.824 |
392 |
22.88 |
0.726 |
394 |
22.08 |
0.843 |
393 |
21.29 |
0.842 |
377 |
21.15 |
0.842 |
386 |
23.21 |
0.820 |
387 |
20.74 |
0.838 |
390 |
22.49 |
0.825 |
|
21.39 |
0.821 |
The following table lists quantitative results of Multi-view Neural Human Rendering.
|
Training frames |
Unseen human poses |
|
PSNR |
SSIM |
PSNR |
SSIM |
313 |
26.68 |
0.935 |
23.05 |
0.893 |
315 |
19.81 |
0.874 |
18.88 |
0.844 |
392 |
24.73 |
0.902 |
23.66 |
0.893 |
394 |
25.01 |
0.906 |
22.87 |
0.874 |
393 |
23.47 |
0.894 |
22.27 |
0.885 |
377 |
23.79 |
0.918 |
21.94 |
0.885 |
386 |
25.02 |
0.879 |
23.70 |
0.853 |
387 |
22.65 |
0.858 |
20.97 |
0.866 |
390 |
23.72 |
0.873 |
22.65 |
0.858 |
|
23.87 |
0.893 |
22.22 |
0.872 |
The following table lists quantitative results of Deferred Neural Human Rendering.
|
Training frames |
Unseen human poses |
|
PSNR |
SSIM |
PSNR |
SSIM |
313 |
25.78 |
0.929 |
22.56 |
0.889 |
315 |
19.44 |
0.869 |
18.38 |
0.841 |
392 |
24.96 |
0.905 |
24.08 |
0.900 |
394 |
24.84 |
0.903 |
22.67 |
0.871 |
393 |
23.50 |
0.896 |
22.45 |
0.888 |
377 |
23.74 |
0.917 |
22.07 |
0.886 |
386 |
24.93 |
0.877 |
23.70 |
0.851 |
387 |
22.44 |
0.888 |
20.64 |
0.862 |
390 |
24.33 |
0.881 |
22.90 |
0.864 |
|
23.77 |
0.896 |
22.16 |
0.872 |