Xenova HF staff commited on
Commit
4a45991
·
verified ·
1 Parent(s): cc93df4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -5,6 +5,7 @@ tags:
5
  - fashion
6
  - multimodal retrieval
7
  - siglip
 
8
  library_name: open_clip
9
  pipeline_tag: zero-shot-image-classification
10
  license: apache-2.0
@@ -25,6 +26,9 @@ The model was fine-tuned from ViT-B-16-SigLIP (webli).
25
 
26
 
27
  ## Usage
 
 
 
28
  The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
29
 
30
  ```python
@@ -49,6 +53,55 @@ with torch.no_grad(), torch.cuda.amp.autocast():
49
  print("Label probs:", text_probs)
50
  ```
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## Benchmark Results
53
  Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below:
54
 
 
5
  - fashion
6
  - multimodal retrieval
7
  - siglip
8
+ - transformers.js
9
  library_name: open_clip
10
  pipeline_tag: zero-shot-image-classification
11
  license: apache-2.0
 
26
 
27
 
28
  ## Usage
29
+
30
+ ### OpenCLIP
31
+
32
  The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
33
 
34
  ```python
 
53
  print("Label probs:", text_probs)
54
  ```
55
 
56
+ ### Transformers.js
57
+
58
+ You can also run the model in JavaScript with the [Transformers.js](https://huggingface.co/docs/transformers.js) library.
59
+
60
+ First, install it from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
61
+
62
+ ```bash
63
+ npm i @huggingface/transformers
64
+ ```
65
+
66
+ Then, compute embeddings as follows:
67
+ ```js
68
+ import { SiglipTextModel, SiglipVisionModel, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
69
+
70
+ const model_id = 'Marqo/marqo-fashionSigLIP';
71
+
72
+ // Load tokenizer and text model
73
+ const tokenizer = await AutoTokenizer.from_pretrained(model_id);
74
+ const text_model = await SiglipTextModel.from_pretrained(model_id);
75
+
76
+ // Load processor and vision model
77
+ const processor = await AutoProcessor.from_pretrained(model_id);
78
+ const vision_model = await SiglipVisionModel.from_pretrained(model_id);
79
+
80
+ // Run tokenization
81
+ const texts = ['a hat', 'a t-shirt', 'shoes'];
82
+ const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
83
+
84
+ // Compute text embeddings
85
+ const { text_embeds } = await text_model(text_inputs);
86
+
87
+ // Read image and run processor
88
+ const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
89
+ const image_inputs = await processor(image);
90
+
91
+ // Compute vision embeddings
92
+ const { image_embeds } = await vision_model(image_inputs);
93
+
94
+ // Compute similarity scores
95
+ const normalized_text_embeds = text_embeds.normalize().tolist();
96
+ const normalized_image_embeds = image_embeds.normalize().tolist()[0];
97
+
98
+ const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
99
+ 100.0 * dot(normalized_image_embeds, text_embed)
100
+ ));
101
+ console.log(text_probs);
102
+ // [0.9860219105287394, 0.00777916527489097, 0.006198924196369721]
103
+ ```
104
+
105
  ## Benchmark Results
106
  Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below:
107