Sentence Similarity
sentence-transformers
English
bert
ctranslate2
int8
float16
feature-extraction
Inference Endpoints
text-embeddings-inference
5 papers
michaelfeil commited on
Commit
a06b8c9
1 Parent(s): 66c1a34

Upload sentence-transformers/all-MiniLM-L6-v2 ctranslate fp16 weights

Browse files
Files changed (1) hide show
  1. README.md +24 -4
README.md CHANGED
@@ -38,17 +38,37 @@ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on
38
 
39
  quantized version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
40
  ```bash
41
- pip install hf-hub-ctranslate2>=3.0.0 ctranslate2>=3.16.0
42
  ```
43
 
44
  ```python
45
  # from transformers import AutoTokenizer
46
  model_name = "michaelfeil/ct2fast-all-MiniLM-L6-v2"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  from hf_hub_ctranslate2 import CT2SentenceTransformer
49
  model = CT2SentenceTransformer(
50
- model_name, compute_type="int8_float16", device="cuda",
51
- repo_contains_ct2=True
52
  )
53
  embeddings = model.encode(
54
  ["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
@@ -62,7 +82,7 @@ scores = (embeddings @ embeddings.T) * 100
62
  ```
63
 
64
  Checkpoint compatible to [ctranslate2>=3.16.0](https://github.com/OpenNMT/CTranslate2)
65
- and [hf-hub-ctranslate2>=3.0.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
66
  - `compute_type=int8_float16` for `device="cuda"`
67
  - `compute_type=int8` for `device="cpu"`
68
 
 
38
 
39
  quantized version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
40
  ```bash
41
+ pip install hf-hub-ctranslate2>=2.11.0 ctranslate2>=3.16.0
42
  ```
43
 
44
  ```python
45
  # from transformers import AutoTokenizer
46
  model_name = "michaelfeil/ct2fast-all-MiniLM-L6-v2"
47
+ model_name_orig=sentence-transformers/all-MiniLM-L6-v2
48
+
49
+ from hf_hub_ctranslate2 import EncoderCT2fromHfHub
50
+ model = EncoderCT2fromHfHub(
51
+ # load in int8 on CUDA
52
+ model_name_or_path=model_name,
53
+ device="cuda",
54
+ compute_type="int8_float16",
55
+ )
56
+ outputs = model.generate(
57
+ text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
58
+ max_length=64,
59
+ )
60
+ # perform downstream tasks on outputs
61
+ outputs["pooler_output"]
62
+ outputs["last_hidden_state"]
63
+ outputs["attention_mask"]
64
+
65
+ # alternative, use SentenceTransformer Mix-In
66
+ # for end-to-end Sentence embeddings generation
67
+ # not pulling from this repo
68
 
69
  from hf_hub_ctranslate2 import CT2SentenceTransformer
70
  model = CT2SentenceTransformer(
71
+ model_name_orig, compute_type="int8_float16", device="cuda",
 
72
  )
73
  embeddings = model.encode(
74
  ["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
 
82
  ```
83
 
84
  Checkpoint compatible to [ctranslate2>=3.16.0](https://github.com/OpenNMT/CTranslate2)
85
+ and [hf-hub-ctranslate2>=2.11.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
86
  - `compute_type=int8_float16` for `device="cuda"`
87
  - `compute_type=int8` for `device="cpu"`
88