Spaces:
Build error
Build error
DEPLOY_TEXT = f""" | |
# 🚀 Deployment Tips | |
A collection of powerful models is valuable, but ultimately, you need to be able to use them effectively. | |
This tab is dedicated to providing guidance and code snippets for performing inference with leaderboard models on Intel platforms. | |
Below is a table of open-source software options for inference, along with the supported Intel hardware platforms. | |
A 🚀 indicates that inference with the associated software package is supported on the hardware. We hope this information | |
helps you choose the best option for your specific use case. Happy building! | |
<div style="display: flex; justify-content: center;"> | |
<table border="1"> | |
<tr> | |
<th>Inference Software</th> | |
<th>Gaudi</th> | |
<th>Xeon</th> | |
<th>GPU Max</th> | |
<th>Arc GPU</th> | |
<th>Core Ultra</th> | |
</tr> | |
</tr> | |
<td>PyTorch</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
</tr> | |
<tr> | |
<td>OpenVINO</td> | |
<td></td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
</tr> | |
<tr> | |
<td>Hugging Face</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
<td>🚀</td> | |
</tr> | |
</table> | |
</div> | |
<hr> | |
# Intel® Gaudi® Accelerators | |
Gaudi is Intel's most capable deep learning chip. You can learn about Gaudi [here](https://habana.ai/products/gaudi2/). | |
👍[Optimum Habana GitHub](https://github.com/huggingface/optimum-habana) | |
The "run_generation.py" script below can be found [here on GitHub](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation) | |
```bash | |
python run_generation.py \ | |
--model_name_or_path meta-llama/Llama-2-7b-hf \ | |
--use_hpu_graphs \ | |
--use_kv_cache \ | |
--max_new_tokens 100 \ | |
--do_sample \ | |
--batch_size 2 \ | |
--prompt "Hello world" "How are you?" | |
``` | |
<hr> | |
# Intel® Xeon® CPUs | |
### Optimum Intel and Intel Extension for PyTorch (no quantization) | |
🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures. | |
👍 [Optimum Intel GitHub](https://github.com/huggingface/optimum-intel) | |
Requires installing/updating optimum `pip install --upgrade-strategy eager optimum[ipex]` | |
```python | |
from optimum.intel import IPEXModelForCausalLM | |
from transformers import AutoTokenizer, pipeline | |
model = IPEXModelForCausalLM.from_pretrained(model_id) | |
tokenizer = AutoTokenizer.from_pretrained(model_id) | |
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) | |
results = pipe("A fisherman at sea...") | |
``` | |
### Intel® Extension for PyTorch - Mixed Precision (fp32 and bf16) | |
```python | |
import torch | |
import intel_extension_for_pytorch as ipex | |
import transformers | |
model= transformers.AutoModelForCausalLM(model_name_or_path).eval() | |
dtype = torch.float # or torch.bfloat16 | |
model = ipex.llm.optimize(model, dtype=dtype) | |
# generation inference loop | |
with torch.inference_mode(): | |
model.generate() | |
``` | |
### Intel® Extension for Transformers - INT4 Inference (CPU) | |
```python | |
from transformers import AutoTokenizer | |
from intel_extension_for_transformers.transformers import AutoModelForCausalLM | |
model_name = "Intel/neural-chat-7b-v3-1" | |
prompt = "When winter becomes spring, the flowers..." | |
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
inputs = tokenizer(prompt, return_tensors="pt").input_ids | |
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) | |
outputs = model.generate(inputs) | |
``` | |
<hr> | |
# Intel® Max Series GPU | |
### INT4 Inference (GPU) with Intel Extension for Transformers and Intel Extension for PyTorch | |
👍 [Intel Extension for PyTorch GitHub](https://github.com/intel/intel-extension-for-pytorch) | |
```python | |
import intel_extension_for_pytorch as ipex | |
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM | |
from transformers import AutoTokenizer | |
device_map = "xpu" | |
model_name ="Qwen/Qwen-7B" | |
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
prompt = "When winter becomes spring, the flowers..." | |
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map) | |
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, | |
device_map=device_map, load_in_4bit=True) | |
model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map) | |
output = model.generate(inputs) | |
``` | |
<hr> | |
# Intel® Core Ultra (NPUs and iGPUs) | |
### OpenVINO Tooling with Optimum Intel | |
👍 [OpenVINO GitHub](https://github.com/openvinotoolkit/openvino) | |
```python | |
from optimum.intel import OVModelForCausalLM | |
from transformers import AutoTokenizer, pipeline | |
model_id = "helenai/gpt2-ov" | |
model = OVModelForCausalLM.from_pretrained(model_id) | |
tokenizer = AutoTokenizer.from_pretrained(model_id) | |
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) | |
pipe("In the spring, beautiful flowers bloom...") | |
``` | |
### Intel® NPU Acceleration Library | |
👍 [Intel NPU Acceleration Library GitHub](https://github.com/intel/intel-npu-acceleration-library) | |
```python | |
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM | |
import intel_npu_acceleration_library | |
import torch | |
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" | |
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() | |
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) | |
tokenizer.pad_token_id = tokenizer.eos_token_id | |
streamer = TextStreamer(tokenizer, skip_special_tokens=True) | |
print("Compile model for the NPU") | |
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8) | |
query = input("Ask something: ") | |
prefix = tokenizer(query, return_tensors="pt")["input_ids"] | |
generation_kwargs = dict( | |
input_ids=prefix, | |
streamer=streamer, | |
do_sample=True, | |
top_k=50, | |
top_p=0.9, | |
max_new_tokens=512, | |
) | |
print("Run inference") | |
_ = model.generate(**generation_kwargs) | |
``` | |
<hr> | |
# Intel® Arc GPUs | |
You can learn more about Arc GPUs [here](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html). | |
Code snippets coming soon! | |
""" |