Marqo
/

marqo-fashionCLIP

@@ -4,6 +4,7 @@ tags:
 - e-commerce
 - fashion
 - multimodal retrieval
 library_name: open_clip
 pipeline_tag: zero-shot-image-classification
 license: apache-2.0
@@ -27,6 +28,9 @@ The model was fine-tuned from ViT-B-16 (laion2b_s34b_b88k).
 **Blog**: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)
 ## Usage
 The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
 ```python
@@ -49,7 +53,69 @@ with torch.no_grad(), torch.cuda.amp.autocast():
     text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
 print("Label probs:", text_probs)
 ```
 ## Benchmark Results

 - e-commerce
 - fashion
 - multimodal retrieval
+- transformers.js
 library_name: open_clip
 pipeline_tag: zero-shot-image-classification
 license: apache-2.0
 **Blog**: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)
 ## Usage
+### OpenCLIP
 The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
 ```python
     text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
 print("Label probs:", text_probs)
+```
+### Transformers.js
+You can also run the model in JavaScript with the [Transformers.js](https://huggingface.co/docs/transformers.js) library.
+First, install it from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
+```bash
+npm i @huggingface/transformers
+```
+Then, compute embeddings as follows:
+```js
+import { CLIPTextModelWithProjection, CLIPVisionModelWithProjection, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
+const model_id = 'Marqo/marqo-fashionCLIP';
+// Load tokenizer and text model
+const tokenizer = await AutoTokenizer.from_pretrained(model_id);
+const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);
+// Load processor and vision model
+const processor = await AutoProcessor.from_pretrained(model_id);
+const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);
+// Run tokenization
+const texts = ['a hat', 'a t-shirt', 'shoes'];
+const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
+// Compute text embeddings
+const { text_embeds } = await text_model(text_inputs);
+// Tensor {
+//   dims: [ 2, 512 ],
+//   type: 'float32',
+//   data: Float32Array(1024) [ ... ],
+//   size: 1024
+// }
+// Read image and run processor
+const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
+const image_inputs = await processor(image);
+// Compute vision embeddings
+const { image_embeds } = await vision_model(image_inputs);
+// Tensor {
+//   dims: [ 1, 512 ],
+//   type: 'float32',
+//   data: Float32Array(512) [ ... ],
+//   size: 512
+// }
+// Compute similarity scores
+const normalized_text_embeds = text_embeds.normalize().tolist();
+const normalized_image_embeds = image_embeds.normalize().tolist()[0];
+const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
+    100.0 * dot(normalized_image_embeds, text_embed)
+));
+console.log(text_probs);
+// [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]
 ```
 ## Benchmark Results