# Supplementary Material ## Training and test data We provide a [website](https://zju3dv.github.io/zju_mocap/) for visualization. The multi-view videos are captured by 23 cameras. We train our model on the "0, 6, 12, 18" cameras and test it on the remaining cameras. The following table shows the detailed frame numbers for training and test of each video. Since the video length of each subject is different, we choose the appropriate number of frames for training and test. **Note that since rendering is very slow, we test our model every 30 frames. For example, although the frame range of video 313 is "0-59", we only test our model on the 0-th and 30-th frames.** | Video | 313 | 315 | 377 | 386 | 387 | 390 | 392 | 393 | 394 | | :-----: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Number of frames | 1470 | 2185 | 617 | 646 | 654 | 1171 | 556 | 658 | 859 | | Frame Range (Training) | 0-59 | 0-399 | 0-299 | 0-299 | 0-299 | 700-999 | 0-299 | 0-299 | 0-299 | | Frame Range (Unseen human poses) | 60-1060 | 400-1400 | 300-617 | 300-646 | 300-654 | 0-700 | 300-556 | 300-658 | 300-859 | ## Evaluation metrics **We save our rendering results on novel views of training frames and unseen human poses at [here](https://zjueducn-my.sharepoint.com/:u:/g/personal/pengsida_zju_edu_cn/Ea3VOUy204VAiVJ-V-OGd9YBxdhbtfpS-U6icD_rDq0mUQ?e=cAcylK).** As described in the paper, we evaluate our model in terms of the PSNR and SSIM metrics. A straightforward way for evaluation is calculating the metrics on the whole image. Since we already know the 3D bounding box of the target human, we can project the 3D box to obtain a `bound_mask` and make the colors of pixels outside the mask as zero, as shown in the following figure. ![fig](https://zju3dv.github.io/neuralbody/images/bound_mask.png) As a result, the PSNR and SSIM metrics appear very high performances, as shown in the following table.

	Training frames		Unseen human poses
	PSNR	SSIM	PSNR	SSIM
313	35.21	0.985	29.02	0.964
315	33.07	0.988	25.70	0.957
392	35.76	0.984	31.53	0.971
393	33.24	0.979	28.40	0.960
394	34.31	0.980	29.61	0.961
377	33.86	0.985	30.60	0.977
386	36.07	0.984	33.05	0.974
390	34.48	0.980	30.25	0.964
387	31.39	0.975	27.68	0.961
	34.15	0.982	29.54	0.966

To overcome this problem, a solution is only calculating the metrics on pixels inside the `bound_mask`. Since the SSIM metric requires the input to have the image format, we first compute the 2D box that bounds the `bound_mask` and then crop the corresponding image region. ```python def ssim_metric(rgb_pred, rgb_gt, batch): mask_at_box = batch['mask_at_box'][0].detach().cpu().numpy() H, W = int(cfg.H * cfg.ratio), int(cfg.W * cfg.ratio) mask_at_box = mask_at_box.reshape(H, W) # convert the pixels into an image img_pred = np.zeros((H, W, 3)) img_pred[mask_at_box] = rgb_pred img_gt = np.zeros((H, W, 3)) img_gt[mask_at_box] = rgb_gt # crop the object region x, y, w, h = cv2.boundingRect(mask_at_box.astype(np.uint8)) img_pred = img_pred[y:y + h, x:x + w] img_gt = img_gt[y:y + h, x:x + w] # compute the ssim ssim = compare_ssim(img_pred, img_gt, multichannel=True) return ssim ``` The following table lists corresponding results.

	Training frames		Unseen human poses
	PSNR	SSIM	PSNR	SSIM
313	30.56	0.971	23.95	0.905
315	27.24	0.962	19.56	0.852
392	29.44	0.946	25.76	0.909
394	28.44	0.940	23.80	0.878
393	27.58	0.939	23.25	0.893
377	27.64	0.951	23.91	0.909
386	28.60	0.931	25.68	0.881
387	25.79	0.928	21.60	0.870
390	27.59	0.926	23.90	0.870
	28.10	0.944	23.49	0.885

## Results of other methods on ZJU-MoCap We save rendering results of other methods on novel views of training frames and unseen human poses at [here](https://zjueducn-my.sharepoint.com/:u:/g/personal/pengsida_zju_edu_cn/EQaPRQww70NDqEXeSG-fOeAB5JXFSWiWDW223h5nmkHvwQ?e=mdofbl), including Neural Volumes, Multi-view Neural Human Rendering, and Deferred Neural Human Rendering. **Note that we only generate novel views of training frames for Neural Volumes.** The following table lists quantitative results of Neural Volumes.

	PSNR	SSIM
313	20.09	0.831
315	18.57	0.824
392	22.88	0.726
394	22.08	0.843
393	21.29	0.842
377	21.15	0.842
386	23.21	0.820
387	20.74	0.838
390	22.49	0.825
	21.39	0.821

The following table lists quantitative results of Multi-view Neural Human Rendering.

	Training frames		Unseen human poses
	PSNR	SSIM	PSNR	SSIM
313	26.68	0.935	23.05	0.893
315	19.81	0.874	18.88	0.844
392	24.73	0.902	23.66	0.893
394	25.01	0.906	22.87	0.874
393	23.47	0.894	22.27	0.885
377	23.79	0.918	21.94	0.885
386	25.02	0.879	23.70	0.853
387	22.65	0.858	20.97	0.866
390	23.72	0.873	22.65	0.858
	23.87	0.893	22.22	0.872

The following table lists quantitative results of Deferred Neural Human Rendering.

	Training frames		Unseen human poses
	PSNR	SSIM	PSNR	SSIM
313	25.78	0.929	22.56	0.889
315	19.44	0.869	18.38	0.841
392	24.96	0.905	24.08	0.900
394	24.84	0.903	22.67	0.871
393	23.50	0.896	22.45	0.888
377	23.74	0.917	22.07	0.886
386	24.93	0.877	23.70	0.851
387	22.44	0.888	20.64	0.862
390	24.33	0.881	22.90	0.864
	23.77	0.896	22.16	0.872