michaelfeil
/

ct2fast-flan-ul2

Inference Endpoints

Model card Files Files and versions Community

michaelfeil commited on May 1, 2023

Commit

cd86bbb

•

1 Parent(s): ef6b889

Create README.md

Files changed (1) hide show

README.md +40 -0

README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+---
+license: apache-2.0
+tags:
+- ctranslate2
+---
+# Fast-Inference with Ctranslate2
+Speedup inference by 2x-8x using int8 inference in C++
+quantized version of [google/flan-ul2](https://huggingface.co/google/flan-ul2)
+```bash
+pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
+```
+Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
+- `compute_type=int8_float16` for `device="cuda"`
+- `compute_type=int8`  for `device="cuda"`
+```python
+from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
+model_name = "michaelfeil/ct2fast-flan-ul2"
+model = TranslatorCT2fromHfHub(
+        # load in int8 on CUDA
+        model_name_or_path=model_name,
+        device="cuda",
+        compute_type="int8_float16"
+)
+outputs = model.generate(
+    text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
+    min_decoding_length=24,
+    max_decoding_length=32,
+    max_input_length=512,
+    beam_size=5
+)
+print(outputs)
+```
+# Licence and other remarks:
+This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.