Use cases of this model
1) Given an image with an object, detect it. (e.g. Where is Waldo? app)
2) Given an image with multiple instances of an object, detect them (e.g. labeling tool assistance for bounding box annotation)
3) Find an object within an image using either text or image as input (e.g. Image Search app - this would require pruning candidates using a threshold and using the score distribution in the output. Search using an input image could be useful when trying to find things that are hard to describe in text like a machine part)

Links to apps/notebooks of other SOTA models for open vocabulary object detection or zero-shot object detection
a) RegionCLIP
b) Colab notebook for Object-Centric-OVD

Note: Inference time depends on input image size. Typically images with dimensions less than 500px has response time under 5 secs on CPU.
While most examples showcased illustrate model capabilities, some illustrate model's limitations - such as finding globe,bird cage,teapot etc.Also, the model appears to have text region detection and limited text recognition capabilities

Images below are from Wikipedia, COCO and PASCAL VOC 2012 datasets

""" demo = gr.Interface( query_image, inputs=[gr.Image(), "text",gr.Slider(1, 10, value=1)], outputs=["image","text"], server_port=80, server_name="0.0.0.0", title="Where is Waldo? (implemented with OWL-ViT)", description=description, examples=[ ["assets/Hidden_object_game_scaled.png", "bicycle", 1], ["assets/Hidden_object_game_scaled.png", "laptop", 1], ["assets/Hidden_object_game_scaled.png", "abacus", 1], ["assets/Hidden_object_game_scaled.png", "frog", 1], ["assets/Hidden_object_game_scaled.png", "bird cage", 2], ["assets/Hidden_object_game_scaled.png", "globe", 2], ["assets/Hidden_object_game_scaled.png", "teapot", 3], ["assets/bus_ovd.jpg", "license plate", 1], ["assets/bus_ovd.jpg", "sign saying ARRIVA", 1], ["assets/bus_ovd.jpg", "sign saying ARRIVAL", 1], ["assets/bus_ovd.jpg", "crossing push button", 1], ["assets/bus_ovd.jpg", "building on moutain", 2], ["assets/bus_ovd.jpg", "road marking", 3], ["assets/bus_ovd.jpg", "mirror", 1], ["assets/bus_ovd.jpg", "traffic camera", 1], ["assets/bus_ovd.jpg", "red bus,blue bus", 2], ["assets/calf.png", "snout,tail", 1], ["assets/calf.png", "hoof", 4], ["assets/calf.png", "ear", 2], ["assets/calf.png", "tag", 1], ["assets/calf.png", "hay", 1], ["assets/calf.png", "barbed wire", 1], ["assets/calf.png", "grass", 1], ["assets/calf.png", "can", 2], ["assets/road_signs.png", "STOP", 1], ["assets/road_signs.png", "STOP sign", 1], ["assets/road_signs.png", "arrow", 1], ["assets/road_signs.png", "ROAD", 1], ["assets/road_signs.png", "triangle", 1], ], ) demo.launch()