--- license: apache-2.0 tags: - ctranslate2 --- # Fast-Inference with Ctranslate2 Speedup inference by 2x-8x using int8 inference in C++ quantized version of [google/flan-ul2](https://huggingface.co/google/flan-ul2) ```bash pip install hf_hub_ctranslate2>=2.0.6 ctranslate2>=3.13.0 ``` Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2) - `compute_type=int8_float16` for `device="cuda"` - `compute_type=int8` for `device="cpu"` ```python from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub model_name = "michaelfeil/ct2fast-flan-ul2" model = TranslatorCT2fromHfHub( # load in int8 on CUDA model_name_or_path=model_name, device="cuda", compute_type="int8_float16" ) outputs = model.generate( text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"], min_decoding_length=24, max_decoding_length=32, max_input_length=512, beam_size=5 ) print(outputs) ``` # Licence and other remarks: This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.