arxiv:2207.12994

V^2L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

Published on Jul 26, 2022

Authors:

Wenhao Wang ,

Abstract

Product retrieval is of great importance in the ecommerce domain. This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9), which is featured for an ensemble of about 20 models from vision models and vision-language models. While model ensemble is common, we show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority. Specifically, for the vision models, we use a two-stage training pipeline which first learns from the coarse labels provided in the training set and then conducts fine-grained self-supervised training, yielding a coarse-to-fine metric learning manner. For the vision-language models, we use the textual description of the training image as the supervision signals for fine-tuning the image-encoder (feature extractor). With these designs, our solution achieves 0.7623 MAR@10, ranking the first place among all the competitors. The code is available at: https://github.com/WangWenhao0716/V2L{V^2L}.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2207.12994 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2207.12994 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2207.12994 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.