Spaces:

nyunai
/

edge-llm-leaderboard

Running

App Files Files Community

edge-llm-leaderboard / README.md

Arnav Chavan

alignment fix and desription change

c5bc8e4 17 days ago

preview code

raw

history blame

2.99 kB

	---
	title: Edge LLM Leaderboard
	emoji: 🌖
	colorFrom: red
	colorTo: blue
	sdk: gradio
	sdk_version: 5.8.0
	app_file: app.py
	pinned: true
	license: apache-2.0
	tags: [edge llm leaderboard, llm edge leaderboard, llm, edge, leaderboard]
	---

	# Edge LLM leaderboard

	## 📝 About
	The Edge LLM Leaderboard is a leaderboard to gauge practical performance and quality of edge LLMs.
	Its aim is to benchmark the performance (throughput and memory)
	of Large Language Models (LLMs) on Edge hardware - starting with a Raspberry Pi 5 (8GB) based on the ARM Cortex A76 CPU.

	Anyone from the community can request a new base model or edge hardware/backend/optimization
	configuration for automated benchmarking:

	- Model evaluation requests will be made live soon, in the meantime feel free to email to - arnav[dot]chavan[@]nyunai[dot]com

	## ✍️ Details

	- To avoid multi-thread discrepencies, all 4 threads are used on the Pi 5.
	- LLMs are running on a singleton batch with a prompt size of 512 and generating 128 tokens.

	All of our throughput benchmarks are ran by this single tool
	[llama-bench](https://github.com/ggerganov/llama.cpp/tree/master/examples/llama-bench)
	using the power of [llama.cpp](https://github.com/ggerganov/llama.cpp) to guarantee reproducibility and consistency.

	## 🏆 Ranking Models

	We use MMLU (zero-shot) via [llama-perplexity](https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity) for performance evaluation, focusing on key metrics relevant for edge applications:

	1. Prefill Latency (Time to First Token - TTFT): Measures the time to generate the first token. Low TTFT ensures a smooth user experience, especially for real-time interactions in edge use cases.

	2. Decode Latency (Generation Speed): Indicates the speed of generating subsequent tokens, critical for real-time tasks like transcription or extended dialogue sessions.

	3. Model Size: Smaller models are better suited for edge devices with limited secondary storage compared to cloud or GPU systems, making efficient deployment possible.

	These metrics collectively address the unique challenges of deploying LLMs on edge devices, balancing performance, responsiveness, and memory constraints.

	## 🏃 How to run locally

	To run the Edge LLM Leaderboard locally on your machine, follow these steps:

	### 1. Clone the Repository

	First, clone the repository to your local machine:

	```bash
	git clone https://huggingface.co/spaces/nyunai/edge-llm-leaderboard
	cd edge-llm-leaderboard
	```

	### 2. Install the Required Dependencies

	Install the necessary Python packages listed in the requirements.txt file:
	`pip install -r requirements.txt`

	### 3. Run the Application

	You can run the Gradio application in one of the following ways:
	- Option 1: Using Python
	`python app.py`
	- Option 2: Using Gradio CLI (include hot-reload)
	`gradio app.py`

	### 4. Access the Application

	Once the application is running, you can access it locally in your web browser at http://127.0.0.1:7860/