Details about this model
#1
by
sushmapiraka
- opened
What does this model do?
Like just image and text crossmodal search or does it include audio and video too?
Hi,
@sushmapiraka
. Thanks for your interest.
This checkpoint only supprt text-image search. However, our methodology [1] can learn text-audio or text-video search if sufficient data and computational resources available.
[1] Retrieval-based Disentangled Representation Learning with Natural Language Supervision