SignCLIP: Connecting Text and Sign Language by Contrastive Learning
Abstract
We present Sign<PRE_TAG>CLIP</POST_TAG>, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project <PRE_TAG>spoken language text</POST_TAG> and sign language videos, two classes of natural languages of distinct modalities, into the same space. Sign<PRE_TAG>CLIP</POST_TAG> is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain Sign<PRE_TAG>CLIP</POST_TAG> on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. Sign<PRE_TAG>CLIP</POST_TAG> discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the <PRE_TAG>spoken language text</POST_TAG> and sign language poses, which provides additional linguistic insights. Our code and models are openly available.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper