Fast-Inference with Ctranslate2

Speedup inference by 2x-8x using int8 inference in C++

quantized version of google/flan-ul2

pip install hf_hub_ctranslate2>=2.0.6 ctranslate2>=3.13.0

Checkpoint compatible to ctranslate2 and hf-hub-ctranslate2

  • compute_type=int8_float16 for device="cuda"
  • compute_type=int8 for device="cpu"
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub

model_name = "michaelfeil/ct2fast-flan-ul2"
model = TranslatorCT2fromHfHub(
        # load in int8 on CUDA
        model_name_or_path=model_name, 
        device="cuda",
        compute_type="int8_float16"
)
outputs = model.generate(
    text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
    min_decoding_length=24,
    max_decoding_length=32,
    max_input_length=512,
    beam_size=5
)
print(outputs)

Licence and other remarks:

This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

Downloads last month
12
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.