Demonstration of Cross-modal Retrieval (FLIP-based model)
FLIP (Facial Language Image Pretraining)
This repository is the official implementation of FaceCaption-15M.
Updates:
[24/07/20] The usage of FLIP has been released! OpenFace-CQUPT/FLIP-demo
[24/07/17] The model named FLIP has been released! OpenFace-CQUPT/FLIP
Overview of FLIP architecture.
Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.
Training
Coming soon......(Only for the datasets been published, the code of training is meaningful.)
python pretrain.py > log.log
Pre-trained Models
We provide pretrained model weights :
FLIP Base —— click here
FLIP Large —— coming soon......
Datasets
Download the FaceCaption-15M dataset from here.
Results
Task1: Text-Image Retrieval
Table 1: Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.
Task2: Facial Attributes Prediction
Table 2: Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.
Task3: Sketch Less Facial Image Retrieval
Table 3: Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.
Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.
Contacts
mailto: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com
Citation
@misc{dai202415mmultimodalfacialimagetext,
title={15M Multimodal Facial Image-Text Dataset},
author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
year={2024},
eprint={2407.08515},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.08515},
}