Spaces:

ibm
/

llm-rank-themselves

Running

sadhamanus commited on Aug 14, 2024

Commit

9853939

verified ·

1 Parent(s): de201d8

Fixed typos and minor edits (#1)

- Fixed typos and minor edits (d7040420ed91d2308e8911de9910e4d8b54a67ae)

Co-authored-by: Amit Dhurandhar <sadhamanus@users.noreply.huggingface.co>

Files changed (1) hide show

assets/header.md CHANGED Viewed

@@ -1,6 +1,8 @@
-<h1 style='text-align: center; color: black;'>🥇 Ranking LLMs without ground truth </h1>
-This space demonstrates reference-free ranking of large language models describe in our ACL Findings paper [Ranking Large Language Models without Ground Truth](https://arxiv.org/abs/2402.14860). <br>
-Inspired by real life where both an expert and a knowledgeable person can identify a novice the main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. Iteratively performing such evaluations yields a estimated ranking that doesn't require ground truth/reference data which can be expensive to gather. The methods are a viable low-resource ranking mechanism for practical use. [Source code](https://huggingface.co/spaces/ibm/llm-rank-themselves/tree/main) is included as part of this space. Installation and usage instructions are provided below.<br>

+<h1 style='text-align: center; color: black;'>🥇 Ranking LLMs without Ground Truth </h1>
+This space demonstrates ranking of large language models with access to just input prompts (viz. only questions in Q&A tasks) as described in our 2024 ACL Findings paper [Ranking Large Language Models without Ground Truth](https://arxiv.org/abs/2402.14860). <br>
+[Source code](https://huggingface.co/spaces/ibm/llm-rank-themselves/tree/main) is included as part of this space. Installation and usage instructions are provided below.
+Inspired by real life where both an expert and a knowledgeable person can identify a novice the main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. Iteratively performing such evaluations yields a estimated ranking that doesn't require ground truth/reference data which can be expensive to gather. The methods are a viable low-resource ranking mechanism for practical use. <br>