Don't you think we should add a tag "Evaluation" for datasets that are meant to be benchmarks and not for training ?
At least, when someone is collecting a group of datasets from an organization or let's say the whole hub can filter based on that tag and avoid somehow contaminating their "training" data.