For each image in the test set, you will predict a list of categories belonging to the image and classify whether the image is out-of-sample. The out-of-sample predictions are evaluated by the AUCROC score, rescaled as follows: π‘ π΄π‘ˆπΆ=2π΄π‘ˆπΆβˆ’1 Predictions of categories are evaluated according to the Mean Average Precision at 20 (MAP@20): 𝑀𝐴𝑃@20=1π‘ˆβˆ‘π‘’=1π‘ˆβˆ‘π‘˜=1π‘šπ‘–π‘›(𝑛,20)𝑃(π‘˜)Γ—π‘Ÿπ‘’π‘™(π‘˜) where π‘ˆ is the number of images, 𝑃(π‘˜) is the precision at cutoff π‘˜ , 𝑛 is the number predictions per image, and π‘Ÿπ‘’π‘™(π‘˜) is an indicator function equaling 1 if the item at rank π‘˜ is a relevant (correct) label, zero otherwise. Once a correct label has been scored for an observation, that label is no longer considered relevant for that observation, and additional predictions of that label are skipped in the calculation. For example, if the correct label is 1 for an observation, the following predictions all score an average precision of 1.0. [1, 2, 3, 4, 5] [1, 1, 1, 1, 1] [1, 2, 1, 3, 1] The final score is the simple average: 12(π‘ π΄π‘ˆπΆ+𝑀𝐴𝑃@20) You can find a Python implementation of the metric in metric.py. Submission Format Entries should be submitted as a csv file with each line representing a single image. The image uuid should be followed by all detected categories and an osd score: id,categories,osd 8119e2ac-ca3a-4c3b-9e1c-c7a079a705c8,1 146 10 12 44 210,0.1 11e2891-93a3-4532-a4ea-6e22e335ae54,17 82 251,0.9 ... The id is the image's file_name as indicated in the annotation files. The categories should be a space-delimited list of integer codes ordered by the probability the category is present in the image; this list will be automatically truncated after 20 entries. The osd field should be a probability or confidence score of whether the image is drawn from a different distribution relative to the training set