Spaces:

optimum-nvidia
/

README

Running

README / README.md

mfuntowicz HF staff

typo

d3a61e4 verified 11 months ago

1.5 kB

	---
	title: Optimum-Nvidia - TensorRT-LLM optimized inference engines
	emoji: 🚀
	colorFrom: green
	colorTo: yellow
	sdk: static
	pinned: false
	---

	[Optimum-Nvidia](https://github.com/huggingface/optimum-nvidia) allows you to easily leverage Nvidia's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) Inference tool
	through a seemlessly integration following huggingface/transformers API.

	This organisation holds prebuilt TensorRT-LLM compatible engines for various fondational models one can use, fork and deploy to get started as fast as possible and benefits from
	out of the box peak performances on Nvidia hardware.

	Prebuilt engines will attempt (as much as possible) to be build with the best options available and will push updated models following additions to TensorRT-LLM repository.
	This can include (not limited to):
	- Leveraging `float8` quantization on supported hardware (H100/L4/L40/RTX 40xx)
	- Enabling `float8` or `int8` KV cache
	- Enabling in-flight batching for dynamic batching when used in combinaison with Nvidia Triton Inference Server
	- Enabling xQA attention kernels

	Current engines are targetting the following Nvidia TensorCore GPUs and can be found using specific branch matching the targetted GPU in the repo:

	- [4090 (sm_89)](https://huggingface.co/collections/optimum-nvidia/rtx-4090-optimized-tensorrt-llm-models-65e5ebc1240c11001a3e666b)

	Feel free to open-up discussions and ask for models to support through the community tab

	- The Optimum-Nvidia team at 🤗