Evaluations

To compare different generative models, we use FID, sFID, Precision, Recall, and Inception Score. These metrics can all be calculated using batches of samples, which we store in .npz (numpy) files.

Download batches

We provide pre-computed sample batches for the reference datasets, our diffusion models, and several baselines we compare against. These are all stored in .npz format.

Reference dataset batches contain pre-computed statistics over the whole dataset, as well as 10,000 images for computing Precision and Recall. All other batches contain 50,000 images which can be used to compute statistics and Precision/Recall.

Here are links to download all of the sample and reference batches:

LSUN
- LSUN bedroom: reference batch
  - ADM (dropout)
  - DDPM
  - IDDPM
  - StyleGAN
- LSUN cat: reference batch
  - ADM (dropout)
  - StyleGAN2
- LSUN horse: reference batch
  - ADM (dropout)
  - ADM
ImageNet
- ImageNet 64x64: reference batch
  - ADM
  - IDDPM
  - BigGAN
- ImageNet 128x128: reference batch
- ImageNet 256x256: reference batch
- ImageNet 512x512: reference batch

Run evaluations

First, generate or download a batch of samples and download the corresponding reference batch for the given dataset. For this example, we'll use ImageNet 256x256, so the refernce batch is VIRTUAL_imagenet256_labeled.npz and we can use the sample batch admnet_guided_upsampled_imagenet256.npz.

Next, run the evaluator.py script. The requirements of this script can be found in requirements.txt. Pass two arguments to the script: the reference batch and the sample batch. The script will download the InceptionV3 model used for evaluations into the current working directory (if it is not already present). This file is roughly 100MB.

The output of the script will look something like this, where the first ... is a bunch of verbose TensorFlow logging:

$ python evaluator.py VIRTUAL_imagenet256_labeled.npz admnet_guided_upsampled_imagenet256.npz
...
computing reference batch activations...
computing/reading reference batch statistics...
computing sample batch activations...
computing/reading sample batch statistics...
Computing evaluations...
Inception Score: 215.8370361328125
FID: 3.9425574129223264
sFID: 6.140433703346162
Precision: 0.8265
Recall: 0.5309