arxiv:2412.15298

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

Published on Dec 19, 2024

Authors:

Bhaskarjit Sarmah ,

Abstract

We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these experiments.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2412.15298 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2412.15298 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2412.15298 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.