Spaces:

KolumbusLindh
/

LLM-as-a-judge

Sleeping

App Files Files Community

LLM-as-a-judge / README.md

Kolumbus Lindh

updates

7841304 about 1 month ago

preview code

raw

history blame contribute delete

3.18 kB

	---
	title: LLM As A Judge
	emoji: 📚
	colorFrom: red
	colorTo: pink
	sdk: gradio
	sdk_version: 5.8.0
	app_file: app.py
	pinned: false
	short_description: Compare the performance of different models.
	---

	# LLM As A Judge 📚

	LLM As A Judge is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.

	## Features ✨
	- User-Specified Models: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
	- Custom Prompts: Test models with any prompt of your choice.
	- Evaluation Criteria: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
	- Objective Evaluation: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.

	## Requirements ⚙️
	- Only supports LLaMA models saved in GGUF format.
	- Models must be hosted on Hugging Face and accessible via their repository names and filenames.

	## How It Works 🛠️
	1. Input Model Details: Provide the repository names and filenames for both models.
	2. Input Prompt: Enter the prompt to generate responses.
	3. Select Evaluation Criteria: Choose an evaluation criterion (e.g., clarity or relevance).
	4. Generate Responses and Evaluate:
	- The app downloads and loads the specified models.
	- Responses are generated for the given prompt using both models.
	- The LoRA-4100 evaluation model evaluates the responses based on the selected criteria.
	5. View Results: Ratings, detailed explanations, and the declared winner or draw are displayed.

	## Behind the Scenes 🔍
	- Evaluation Model: The app uses the LoRA-4100 model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
	- Dynamic Model Loading: The app downloads and loads models from Hugging Face dynamically based on user input.
	- Inference: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.

	## Example 🌟
	Input:
	- Model A Repository: `KolumbusLindh/LoRA-4100`
	- Model A Filename: `unsloth.F16.gguf`
	- Model B Repository: `forestav/gguf_lora_model`
	- Model B Filename: `finetune_v2.gguf`
	- Prompt: "Explain the significance of the Turing Test in artificial intelligence."
	- Evaluation Criterion: Clarity

	Output:
	- Detailed evaluation results with scores for each model's response.
	- Explanations for the scores based on the selected criterion.
	- Declaration of the winning model or a draw.

	## Limitations 🚧
	- Only works with LLaMA models in GGUF format.
	- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.

	## Configuration Reference 📖
	For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).