File size: 6,081 Bytes
a5a23fc e49739d dd14a75 a5a23fc e49739d 15cdafa d5b3114 3406869 d5b3114 e49739d 3406869 e49739d dd14a75 e49739d 2ceeac4 dd14a75 2ceeac4 dd14a75 e49739d a4311ee e49739d a4311ee e49739d a4311ee e49739d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
tags:
- clip
- e-commerce
- fashion
- multimodal retrieval
- transformers.js
library_name: open_clip
pipeline_tag: zero-shot-image-classification
license: apache-2.0
language:
- en
metrics:
- precision
- recall
- MRR
---
[![GitHub](https://img.shields.io/badge/GitHub-black?logo=github)](https://github.com/marqo-ai/marqo-FashionCLIP)
# Marqo-FashionCLIP Model Card
Marqo-FashionCLIP leverages Generalised Contrastive Learning ([GCL](https://www.marqo.ai/blog/generalized-contrastive-learning-for-multi-modal-retrieval-and-ranking)) which allows the model to be trained on not just text descriptions but also categories, style, colors, materials, keywords and fine-details to provide highly relevant search results on fashion products.
The model was fine-tuned from ViT-B-16 (laion2b_s34b_b88k).
**Github Page**: [Marqo-FashionCLIP](https://github.com/marqo-ai/marqo-FashionCLIP)
**Blog**: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)
## Usage
### OpenCLIP
The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
```python
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')
import torch
from PIL import Image
image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
```
### Transformers.js
You can also run the model in JavaScript with the [Transformers.js](https://huggingface.co/docs/transformers.js) library.
First, install it from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
```bash
npm i @huggingface/transformers
```
Then, compute embeddings as follows:
```js
import { CLIPTextModelWithProjection, CLIPVisionModelWithProjection, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
const model_id = 'Marqo/marqo-fashionCLIP';
// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);
// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);
// Run tokenization
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
// dims: [ 2, 512 ],
// type: 'float32',
// data: Float32Array(1024) [ ... ],
// size: 1024
// }
// Read image and run processor
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);
// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
// dims: [ 1, 512 ],
// type: 'float32',
// data: Float32Array(512) [ ... ],
// size: 512
// }
// Compute similarity scores
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];
const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
// [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]
```
## Benchmark Results
Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below:
**Text-To-Image (Averaged across 6 datasets)**
| Model | AvgRecall | Recall@1 | Recall@10 | MRR |
|----------------------------|-------------|------------|-------------|-----------|
| Marqo-FashionCLIP | **0.192** | **0.094** | **0.290** | **0.200** |
| FashionCLIP2.0 | 0.163 | 0.077 | 0.249 | 0.165 |
| OpenFashionCLIP | 0.132 | 0.060 | 0.204 | 0.135 |
| ViT-B-16-laion2b_s34b_b88k | 0.174 | 0.088 | 0.261 | 0.180 |
**Category-To-Product (Averaged across 5 datasets)**
| Model | AvgP | P@1 | P@10 | MRR |
|----------------------------|-----------|-----------|-----------|-----------|
| Marqo-FashionCLIP | **0.705** | **0.734** | 0.676 | **0.776** |
| FashionCLIP2.0 | 0.684 | 0.681 | **0.686** | 0.741 |
| OpenFashionCLIP | 0.646 | 0.653 | 0.639 | 0.720 |
| ViT-B-16-laion2b_s34b_b88k | 0.662 | 0.673 | 0.652 | 0.743 |
**Sub-Category-To-Product (Averaged across 4 datasets)**
| Model | AvgP | P@1 | P@10 | MRR |
|----------------------------|-----------|-----------|-----------|-----------|
| Marqo-FashionCLIP | **0.707** | **0.747** | **0.667** | **0.772** |
| FashionCLIP2.0 | 0.657 | 0.676 | 0.638 | 0.733 |
| OpenFashionCLIP | 0.598 | 0.619 | 0.578 | 0.689 |
| ViT-B-16-laion2b_s34b_b88k | 0.638 | 0.651 | 0.624 | 0.712 | |