AI & ML interests

Accelerating DL

Recent Activity

optimum's activity

regisssย 
posted an update 13 days ago
pagezyhfย 
posted an update 27 days ago
pagezyhfย 
posted an update 29 days ago
view post
Post
961
Itโ€™s 2nd of December , hereโ€™s your Cyber Monday present ๐ŸŽ !

Weโ€™re cutting our price down on Hugging Face Inference Endpoints and Spaces!

Our folks at Google Cloud are treating us with a 40% price cut on GCP Nvidia A100 GPUs for the next 3๏ธโƒฃ months. We have other reductions on all instances ranging from 20 to 50%.

Sounds like the time to give Inference Endpoints a try? Get started today and find in our documentation the full pricing details.
https://ui.endpoints.huggingface.co/
https://huggingface.co/pricing
pagezyhfย 
posted an update about 1 month ago
view post
Post
299
Hello Hugging Face Community,

if you use Google Kubernetes Engine to host you ML workloads, I think this series of videos is a great way to kickstart your journey of deploying LLMs, in less than 10 minutes! Thank you @wietse-venema-demo !

To watch in this order:
1. Learn what are Hugging Face Deep Learning Containers
https://youtu.be/aWMp_hUUa0c?si=t-LPRkRNfD3DDNfr

2. Learn how to deploy a LLM with our Deep Learning Container using Text Generation Inference
https://youtu.be/Q3oyTOU1TMc?si=V6Dv-U1jt1SR97fj

3. Learn how to scale your inference endpoint based on traffic
https://youtu.be/QjLZ5eteDds?si=nDIAirh1r6h2dQMD

If you want more of these small tutorials and have any theme in mind, let me know!
jeffboudierย 
posted an update about 1 month ago
pagezyhfย 
posted an update about 2 months ago
view post
Post
1359
Hello Hugging Face Community,

I'd like to share here a bit more about our Deep Learning Containers (DLCs) we built with Google Cloud, to transform the way you build AI with open models on this platform!

With pre-configured, optimized environments for PyTorch Training (GPU) and Inference (CPU/GPU), Text Generation Inference (GPU), and Text Embeddings Inference (CPU/GPU), the Hugging Face DLCs offer:

โšก Optimized performance on Google Cloud's infrastructure, with TGI, TEI, and PyTorch acceleration.
๐Ÿ› ๏ธ Hassle-free environment setup, no more dependency issues.
๐Ÿ”„ Seamless updates to the latest stable versions.
๐Ÿ’ผ Streamlined workflow, reducing dev and maintenance overheads.
๐Ÿ”’ Robust security features of Google Cloud.
โ˜๏ธ Fine-tuned for optimal performance, integrated with GKE and Vertex AI.
๐Ÿ“ฆ Community examples for easy experimentation and implementation.
๐Ÿ”œ TPU support for PyTorch Training/Inference and Text Generation Inference is coming soon!

Find the documentation at https://huggingface.co/docs/google-cloud/en/index
If you need support, open a conversation on the forum: https://discuss.huggingface.co/c/google-cloud/69
regisssย 
posted an update 2 months ago
view post
Post
1382
Interested in performing inference with an ONNX model?โšก๏ธ

The Optimum docs about model inference with ONNX Runtime is now much clearer and simpler!

You want to deploy your favorite model on the hub but you don't know how to export it to the ONNX format? You can do it in one line of code as follows:
from optimum.onnxruntime import ORTModelForSequenceClassification

# Load the model from the hub and export it to the ONNX format
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

Check out the whole guide ๐Ÿ‘‰ https://huggingface.co/docs/optimum/onnxruntime/usage_guides/models
jeffboudierย 
posted an update 3 months ago
jeffboudierย 
posted an update 3 months ago
view post
Post
453
Inference Endpoints got a bunch of cool updates yesterday, this is my top 3
jeffboudierย 
posted an update 4 months ago
view post
Post
4035
Pro Tip - if you're a Firefox user, you can set up Hugging Chat as integrated AI Assistant, with contextual links to summarize or simplify any text - handy!

In this short video I show how to set it up
  • 2 replies
ยท
IlyasMoutawwakilย 
posted an update 6 months ago
view post
Post
3998
Last week, Intel's new Xeon CPUs, Sapphire Rapids (SPR), landed on Inference Endpoints and I think they got the potential to reduce the cost of your RAG pipelines ๐Ÿ’ธ

Why ? Because they come with Intelยฎ AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU โšก

I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e

Here's how it works:

- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.

- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.

- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.

Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker