Xenova HF staff commited on
Commit
dd14a75
1 Parent(s): 15cdafa

Add Transformers.js example code

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md CHANGED
@@ -4,6 +4,7 @@ tags:
4
  - e-commerce
5
  - fashion
6
  - multimodal retrieval
 
7
  library_name: open_clip
8
  pipeline_tag: zero-shot-image-classification
9
  license: apache-2.0
@@ -27,6 +28,9 @@ The model was fine-tuned from ViT-B-16 (laion2b_s34b_b88k).
27
  **Blog**: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)
28
 
29
  ## Usage
 
 
 
30
  The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
31
 
32
  ```python
@@ -49,7 +53,69 @@ with torch.no_grad(), torch.cuda.amp.autocast():
49
  text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
50
 
51
  print("Label probs:", text_probs)
 
 
 
 
 
 
 
 
 
 
 
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
  ## Benchmark Results
 
4
  - e-commerce
5
  - fashion
6
  - multimodal retrieval
7
+ - transformers.js
8
  library_name: open_clip
9
  pipeline_tag: zero-shot-image-classification
10
  license: apache-2.0
 
28
  **Blog**: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)
29
 
30
  ## Usage
31
+
32
+ ### OpenCLIP
33
+
34
  The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
35
 
36
  ```python
 
53
  text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
54
 
55
  print("Label probs:", text_probs)
56
+ ```
57
+
58
+ ### Transformers.js
59
+
60
+ You can also run the model in JavaScript with the [Transformers.js](https://huggingface.co/docs/transformers.js) library.
61
+
62
+ First, install it from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
63
+
64
+ ```bash
65
+ npm i @huggingface/transformers
66
+ ```
67
 
68
+ Then, compute embeddings as follows:
69
+
70
+ ```js
71
+ import { CLIPTextModelWithProjection, CLIPVisionModelWithProjection, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
72
+
73
+ const model_id = 'Marqo/marqo-fashionCLIP';
74
+
75
+ // Load tokenizer and text model
76
+ const tokenizer = await AutoTokenizer.from_pretrained(model_id);
77
+ const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);
78
+
79
+ // Load processor and vision model
80
+ const processor = await AutoProcessor.from_pretrained(model_id);
81
+ const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);
82
+
83
+ // Run tokenization
84
+ const texts = ['a hat', 'a t-shirt', 'shoes'];
85
+ const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
86
+
87
+ // Compute text embeddings
88
+ const { text_embeds } = await text_model(text_inputs);
89
+ // Tensor {
90
+ // dims: [ 2, 512 ],
91
+ // type: 'float32',
92
+ // data: Float32Array(1024) [ ... ],
93
+ // size: 1024
94
+ // }
95
+
96
+ // Read image and run processor
97
+ const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
98
+ const image_inputs = await processor(image);
99
+
100
+ // Compute vision embeddings
101
+ const { image_embeds } = await vision_model(image_inputs);
102
+ // Tensor {
103
+ // dims: [ 1, 512 ],
104
+ // type: 'float32',
105
+ // data: Float32Array(512) [ ... ],
106
+ // size: 512
107
+ // }
108
+
109
+
110
+ // Compute similarity scores
111
+ const normalized_text_embeds = text_embeds.normalize().tolist();
112
+ const normalized_image_embeds = image_embeds.normalize().tolist()[0];
113
+
114
+ const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
115
+ 100.0 * dot(normalized_image_embeds, text_embed)
116
+ ));
117
+ console.log(text_probs);
118
+ // [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]
119
  ```
120
 
121
  ## Benchmark Results