Spaces:

ricdomolm
/

caselawqa_leaderboard

Running

App Files Files Community

RicardoDominguez commited on Sep 11, 2024

Commit

279610b

1 Parent(s): 1358bcc

about

Browse files

Files changed (3) hide show

README.md +2 -2
src/about.py +38 -14
src/envs.py +4 -4

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: Demo Leaderboard
-emoji: 🥇
 colorFrom: green
 colorTo: indigo
 sdk: gradio

 ---
+title: CaselawQA leaderboard (WIP)
+emoji: 🏛️
 colorFrom: green
 colorTo: indigo
 sdk: gradio

src/about.py CHANGED Viewed

@@ -12,29 +12,45 @@ class Task:
 # ---------------------------------------------------
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
-    task0 = Task("anli_r1", "acc", "ANLI")
-    task1 = Task("logiqa", "acc_norm", "LogiQA")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-Intro text
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
-## How it works
 ## Reproducibility
-To reproduce our results, here is the commands you can run:
 """
 EVALUATION_QUEUE_TEXT = """
@@ -50,16 +66,13 @@ tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
 If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
 Note: make sure your model is public!
-Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
 ### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
 It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
-### 3) Make sure your model has an open license!
-This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
-### 4) Fill up your model card
-When we add extra information about models to the leaderboard, it will be automatically taken from the model card
 ## In case of model failure
 If your model is displayed in the `FAILED` category, its execution stopped.
@@ -69,4 +82,15 @@ If everything is done, check you can launch the EleutherAIHarness on your model
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

 # ---------------------------------------------------
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
+    task0 = Task("caselawqa", "exact_match", "CaselawQA")
+    task1 = Task("caselawqa_tiny", "exact_match", "CaselawQA Tiny")
+    task2 = Task("caselawqa_hard", "exact_match", "CaselawQA Hard")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
 # Your leaderboard name
+TITLE = """<h1 align="center" id="space-title">CaselawQA leaderboard (WIP)</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+CaselawQA is a benchmark comprising classification tasks, drawing from the Supreme Court and Songer Court of Appeals legal databases.
+From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement.
+From a substantive legal perspective, efficient solutions to such classification problems have rich and important applications in legal research.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
+## Introduction
+CaselawQA is a benchmark comprising legal classification tasks, drawing from the Supreme Court and Songer Court of Appeals legal databases.
+The majority of its 10,000 questions are multiple-choice, with 5,000 sourced from each database.
+The questions are randomly selected from the test sets of the [Lawma tasks](https://huggingface.co/datasets/ricdomolm/lawma-tasks).\
+From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement.
+From a substantive legal perspective, efficient solutions to such classification problems have rich and important applications in legal research.
+CaselawQA also includes two additional subsets: CaselawQA Tiny and CaselawQA Hard.
+CaselawQA Tiny consists of 49 Lawma tasks with fewer than 150 training examples.
+CaselawQA Hard comprises tasks where [Lawma 70B](https://huggingface.co/ricdomolm/lawma-70b) achieves less than 70% accuracy.
+You can find more information in the [Lawma arXiv preprint](https://arxiv.org/abs/2407.16615) and [GitHub repository](https://github.com/socialfoundations/lawma).
 ## Reproducibility
+With evaluate CaselawQA using [this](https://github.com/socialfoundations/lm-evaluation-harness/tree/caselawqa) LM Eval Harness implementation:
+```bash
+lm_eval --model hf --model_args "pretrained=<your_model>,dtype=bfloat16" --tasks caselawqa,caselawqa_tiny,caselawqa_hard --output_path=<output_path>
 """
 EVALUATION_QUEUE_TEXT = """
 If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
 Note: make sure your model is public!
+Note: if your model needs `use_remote_code=True`, we do not support this option.
 ### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
 It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
+### 3) Fill up your model card
+When we add extra information about models to the leaderboard, it will be automatically taken from the model card.
 ## In case of model failure
 If your model is displayed in the `FAILED` category, its execution stopped.
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+```bibtex
+@misc{dominguezolmedo2024lawmapowerspecializationlegal,
+      title={Lawma: The Power of Specialization for Legal Tasks},
+      author={Ricardo Dominguez-Olmedo and Vedant Nanda and Rediet Abebe and Stefan Bechtold and Christoph Engel and Jens Frankenreiter and Krishna Gummadi and Moritz Hardt and Michael Livermore},
+      year={2024},
+      eprint={2407.16615},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2407.16615},
+}
+```
 """

src/envs.py CHANGED Viewed

@@ -6,12 +6,12 @@ from huggingface_hub import HfApi
 # ----------------------------------
 TOKEN = os.environ.get("HF_TOKEN") # A read/write token for your org
-OWNER = "demo-leaderboard-backend" # Change to your org - don't forget to create a results and request dataset, with the correct format!
 # ----------------------------------
-REPO_ID = f"{OWNER}/leaderboard"
-QUEUE_REPO = f"{OWNER}/requests"
-RESULTS_REPO = f"{OWNER}/results"
 # If you setup a cache later, just change HF_HOME
 CACHE_PATH=os.getenv("HF_HOME", ".")

 # ----------------------------------
 TOKEN = os.environ.get("HF_TOKEN") # A read/write token for your org
+OWNER = "ricdomolm" # Change to your org - don't forget to create a results and request dataset, with the correct format!
 # ----------------------------------
+REPO_ID = f"{OWNER}/caselawqa_leaderboard"
+QUEUE_REPO = f"{OWNER}/caselawqa_leaderboard_requests"
+RESULTS_REPO = f"{OWNER}/caselawqa_leaderboard_results"
 # If you setup a cache later, just change HF_HOME
 CACHE_PATH=os.getenv("HF_HOME", ".")