diff --git a/.gitattributes b/.gitattributes
index c7d9f3332a950355d5a77d85000f05e6f45435ea..d2100cc24dc86a37767a31b80ceefff218b9e82c 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -32,3 +32,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
+evals/evals/registry/data/formal_logic/formal_logic_expressions.jsonl filter=lfs diff=lfs merge=lfs -text
+evals/evals/registry/data/ukraine_eit/samples.jsonl filter=lfs diff=lfs merge=lfs -text
diff --git a/evals/.gitattributes b/evals/.gitattributes
new file mode 100644
index 0000000000000000000000000000000000000000..9f45a92b72389d8474e015d2b58b1b30e78f5095
--- /dev/null
+++ b/evals/.gitattributes
@@ -0,0 +1 @@
+evals/registry/data/**/*.jsonl filter=lfs diff=lfs merge=lfs -text
diff --git a/evals/.github/PULL_REQUEST_TEMPLATE.md b/evals/.github/PULL_REQUEST_TEMPLATE.md
new file mode 100644
index 0000000000000000000000000000000000000000..c3fc6fa3a96e22580de56218af7387bfeaf71a7f
--- /dev/null
+++ b/evals/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,87 @@
+# Thank you for contributing an eval! ♥️
+
+🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
+
+__PLEASE READ THIS__:
+
+In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
+
+We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.
+
+## Eval details 📑
+### Eval name
+[Insert Eval name here]
+
+### Eval description
+
+[Insert a short description of what your eval does here]
+
+### What makes this a useful eval?
+
+[Insert why this eval is worth including and any additional context]
+
+## Criteria for a good eval ✅
+
+Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
+
+Your eval should be:
+
+- [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
+- [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
+- [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval.
+- [ ] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)
+
+If there is anything else that makes your eval worth including, please document it below.
+
+### Unique eval value
+
+> Insert what makes your eval high quality that was not mentioned above. (Not required)
+
+## Eval structure 🏗️
+
+Your eval should
+- [ ] Check that your data is in `evals/registry/data/{name}`
+- [ ] Check that your yaml is registered at `evals/registry/evals/{name}.yaml`
+- [ ] Ensure you have the right to use the data you submit via this eval
+
+(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
+
+## Final checklist 👀
+
+### Submission agreement
+
+By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
+
+- [ ] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
+
+### Email address validation
+
+If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
+
+- [ ] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
+
+### Limited availability acknowledgement
+
+We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
+
+- [ ] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
+
+### Submit eval
+
+- [ ] I have filled out all required fields in the evals PR form
+- [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push
+
+Failure to fill out all required fields will result in the PR being closed.
+
+### Eval JSON data
+
+Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
+
+
+ View evals in JSON
+
+ ### Eval
+ ```jsonl
+ INSERT_EVAL_HERE
+ ```
+
diff --git a/evals/.github/bug_report.yml b/evals/.github/bug_report.yml
new file mode 100644
index 0000000000000000000000000000000000000000..404958c39a298d962e021c49a3092c51d9913f91
--- /dev/null
+++ b/evals/.github/bug_report.yml
@@ -0,0 +1,56 @@
+name: Bug report
+description: Create a report to help us improve
+labels: ["bug"]
+body:
+ - type: markdown
+ attributes:
+ value: |
+ Thanks for taking the time to fill out this bug report! If you have questions about using the OpenAI Evals library, please open a [Discussion thread](https://github.com/openai/evals/discussions).
+ - type: textarea
+ id: what-happened
+ attributes:
+ label: Describe the bug
+ description: A clear and concise description of what the bug is, and any additional context.
+ placeholder: Tell us what you see!
+ validations:
+ required: true
+ - type: textarea
+ id: repro-steps
+ attributes:
+ label: To Reproduce
+ description: Steps to reproduce the behavior.
+ placeholder: |
+ 1. Fetch a '...'
+ 2. Update the '....'
+ 3. See error
+ validations:
+ required: true
+ - type: textarea
+ id: code-snippets
+ attributes:
+ label: Code snippets
+ description: If applicable, add code snippets to help explain your problem.
+ render: Python
+ validations:
+ required: false
+ - type: input
+ id: os
+ attributes:
+ label: OS
+ placeholder: macOS
+ validations:
+ required: true
+ - type: input
+ id: language-version
+ attributes:
+ label: Python version
+ placeholder: Python v3.8.0
+ validations:
+ required: true
+ - type: input
+ id: lib-version
+ attributes:
+ label: Library version
+ placeholder: openai-evals v0.1.1
+ validations:
+ required: true
diff --git a/evals/.github/config.yml b/evals/.github/config.yml
new file mode 100644
index 0000000000000000000000000000000000000000..2afa082cf06241c8c48bb62d7e2ad335a96eef2a
--- /dev/null
+++ b/evals/.github/config.yml
@@ -0,0 +1,7 @@
+blank_issues_enabled: false
+contact_links:
+ - name: OpenAI support
+ url: https://help.openai.com/
+ about: |
+ Please only file issues here that you believe represent actual bugs or feature requests for the OpenAI Evals library.
+ If you're having general trouble with the OpenAI API, ChatGPT, etc, please visit our help center to get support.
\ No newline at end of file
diff --git a/evals/.github/feature_request.yml b/evals/.github/feature_request.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f1f2653a5c3e1bfeae22cfebed485e8ae3c37f69
--- /dev/null
+++ b/evals/.github/feature_request.yml
@@ -0,0 +1,20 @@
+name: Feature request
+description: Suggest an idea for this library
+labels: ["feature-request"]
+body:
+ - type: markdown
+ attributes:
+ value: |
+ Thanks for taking the time to fill out this feature request! Please note, we are not able to accommodate all feature requests given limited bandwidth but we appreciate you taking the time to share with us how to improve the OpenAI Evals library.
+ - type: textarea
+ id: feature
+ attributes:
+ label: Describe the feature or improvement you're requesting
+ description: A clear and concise description of what you want to happen.
+ validations:
+ required: true
+ - type: textarea
+ id: context
+ attributes:
+ label: Additional context
+ description: Add any other context about the feature request here.
\ No newline at end of file
diff --git a/evals/.github/workflows/parse_yaml.py b/evals/.github/workflows/parse_yaml.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a8766eb1cfbe2b88d5fbaddffa18e3a244a403e
--- /dev/null
+++ b/evals/.github/workflows/parse_yaml.py
@@ -0,0 +1,12 @@
+import sys
+import yaml
+
+def get_first_key(file_path):
+ with open(file_path, 'r') as yaml_file:
+ content = yaml.safe_load(yaml_file)
+ first_key = next(iter(content))
+ return first_key
+
+if __name__ == "__main__":
+ yaml_file_path = sys.argv[1]
+ print(get_first_key(yaml_file_path))
diff --git a/evals/.github/workflows/test_eval.yaml b/evals/.github/workflows/test_eval.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..947d1ddabd7b3d0e77003f873b9831921bfc6b8c
--- /dev/null
+++ b/evals/.github/workflows/test_eval.yaml
@@ -0,0 +1,55 @@
+name: Run new evals
+
+on:
+ pull_request:
+ branches:
+ - main
+
+jobs:
+ check_files:
+ runs-on: ubuntu-latest
+
+ steps:
+ - name: Checkout repository
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+ lfs: true
+
+ - name: Install Git LFS
+ run: |
+ sudo apt-get install git-lfs
+ git lfs install
+
+ - name: Set up Python
+ uses: actions/setup-python@v2
+ with:
+ python-version: 3.9
+
+ - name: Install dependencies
+ run: |
+ python -m pip install --upgrade pip
+ pip install pyyaml
+ pip install -e .
+
+ - name: Get list of new YAML files in evals/registry/evals
+ id: get_files
+ run: |
+ # Use environment files to store the output
+ git diff --name-only --diff-filter=A ${{ github.event.pull_request.base.sha }} ${{ github.sha }} | grep '^evals/registry/evals/.*\.yaml$' | xargs > new_files
+ echo "new_files=$(cat new_files)" >> $GITHUB_ENV
+
+ - name: Run oaieval command for each new YAML file
+ run: |
+ files="${{ env.new_files }}"
+ if [ -n "$files" ]; then
+ for file in $files; do
+ echo "Processing $file"
+ first_key=$(python .github/workflows/parse_yaml.py $file)
+ echo "Eval Name: $first_key"
+ oaieval dummy-chat $first_key --max_samples 10
+ oaieval dummy-completion $first_key --max_samples 10
+ done
+ else
+ echo "No new YAML files found in evals/registry/evals"
+ fi
diff --git a/evals/.gitignore b/evals/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..e34e6b6665177d9c25db03b948485ac5c4e539d1
--- /dev/null
+++ b/evals/.gitignore
@@ -0,0 +1,3 @@
+__pycache__/
+evals.egg-info/
+.vscode/
\ No newline at end of file
diff --git a/evals/.pre-commit-config.yaml b/evals/.pre-commit-config.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1e300e90eab0c4bddbe1b3dfab77f727db5a4ac7
--- /dev/null
+++ b/evals/.pre-commit-config.yaml
@@ -0,0 +1,29 @@
+repos:
+ - repo: https://github.com/psf/black
+ rev: 22.8.0
+ hooks:
+ - id: black
+ args: [--line-length=100, --exclude=""]
+
+ # this is not technically always safe but usually is
+ # use comments `# isort: off` and `# isort: on` to disable/re-enable isort
+ - repo: https://github.com/pycqa/isort
+ rev: 5.12.0
+ hooks:
+ - id: isort
+ args: [--line-length=100, --profile=black]
+
+ # this is slightly dangerous because python imports have side effects
+ # and this tool removes unused imports, which may be providing
+ # necessary side effects for the code to run
+ - repo: https://github.com/PyCQA/autoflake
+ rev: v1.6.1
+ hooks:
+ - id: autoflake
+ args:
+ - "--in-place"
+ - "--expand-star-imports"
+ - "--remove-duplicate-keys"
+ - "--remove-unused-variables"
+ - "--remove-all-unused-imports"
+ exclude: "evals/__init__.py"
diff --git a/evals/LICENSE b/evals/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..b3841f631d7f15f158bbb9e613227550828b5ff1
--- /dev/null
+++ b/evals/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 OpenAI
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/evals/MANIFEST.in b/evals/MANIFEST.in
new file mode 100644
index 0000000000000000000000000000000000000000..971853fe76c429c30f8dc816745ace991fbad685
--- /dev/null
+++ b/evals/MANIFEST.in
@@ -0,0 +1,3 @@
+recursive-include evals *.py
+recursive-include evals *.yaml
+recursive-include evals *.sql
diff --git a/evals/Makefile b/evals/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..a0f0126413209c6caa69eba330191f713954cad8
--- /dev/null
+++ b/evals/Makefile
@@ -0,0 +1,2 @@
+mypy:
+ mypy --config-file=mypy.ini --no-site-packages .
\ No newline at end of file
diff --git a/evals/README.md b/evals/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b6eae170489110f02369f4a7a92713fbb6e0bebd
--- /dev/null
+++ b/evals/README.md
@@ -0,0 +1,89 @@
+This is a fork of the evals repo from OpenAI that allows to evaluate models created outside of OpenAI using the same benchmarks. This provides an opportunity for apple-to-apple comparisons between AGI models of various origins, as long as their input and output specs are aligned.
+
+# Evals
+
+Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
+
+You can use Evals to create and run evaluations that:
+- use datasets to generate prompts,
+- measure the quality of completions provided by an OpenAI model, and
+- compare performance across different datasets and models.
+
+With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps **in order**:
+1. Read through this doc and follow the [setup instructions below](README.md#Setup).
+2. Learn how to run existing evals: [run-evals.md](docs/run-evals.md).
+3. Familiarize yourself with the existing eval templates: [eval-templates.md](docs/eval-templates.md).
+4. Walk through the process for building an eval: [build-eval.md](docs/build-eval.md)
+5. See an example of implementing custom eval logic: [custom-eval.md](docs/custom-eval.md).
+
+If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.
+
+____________________
+🚨 For a limited time, we will be granting GPT-4 access to those who contribute high quality evals. Please follow the instructions mentioned above and note that spam or low quality submissions will be ignored❗️
+
+Access will be granted to the email address associated with an accepted Eval. Due to high volume, we are unable to grant access to any email other than the one used for the pull request.
+____________________
+
+## Setup
+
+To run evals, you will need to set up and specify your OpenAI API key. You can generate one at . After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. **Please be aware of the [costs](https://openai.com/pricing) associated with using the API when running evals.**
+
+**Minimal Required Version: Python 3.9**
+
+### Downloading evals
+
+Our Evals registry is stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, you can fetch the evals with:
+```sh
+git lfs fetch --all
+git lfs pull
+```
+
+You may just want to fetch data for a select eval. You can achieve this via:
+```sh
+git lfs fetch --include=evals/registry/data/${your eval}
+git lfs pull
+```
+
+### Making evals
+
+If you are going to be creating evals, we suggest cloning this repo directly from GitHub and installing the requirements using the following command:
+
+```sh
+pip install -e .
+```
+
+Using `-e`, changes you make to your eval will be reflected immediately without having to reinstall.
+
+### Running evals
+
+If you don't want to contribute new evals, but simply want to run them locally, you can install the evals package via pip:
+
+```sh
+pip install evals
+```
+
+We provide the option for you to log your eval results to a Snowflake database, if you have one or wish to set one up. For this option, you will further have to specify the `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USERNAME`, and `SNOWFLAKE_PASSWORD` environment variables.
+
+## FAQ
+
+Do you have any examples of how to build an eval from start to finish?
+
+- Yes! These are in the `examples` folder. We recommend that you also read through [build-eval.md](docs/build-eval.md) in order to gain a deeper understanding of what is happening in these examples.
+
+Do you have any examples of evals implemented in multiple different ways?
+
+- Yes! In particular, see `evals/registry/evals/coqa.yaml`. We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to help illustrate the differences.
+
+When I run an eval, it sometimes hangs at the very end (after the final report). What's going on?
+
+- This is a known issue, but you should be able to interrupt it safely and the eval should finish immediately after.
+
+There's a lot of code, and I just want to spin up a quick eval. Help? OR,
+
+I am a world-class prompt engineer. I choose not to code. How can I contribute my wisdom?
+
+- If you follow an existing [eval template](docs/eval-templates.md) to build a basic or model-graded eval, you don't need to write any evaluation code at all! Just provide your data in JSON format and specify your eval parameters in YAML. [build-eval.md](docs/build-eval.md) walks you through these steps, and you can supplement these instructions with the Jupyter notebooks in the `examples` folder to help you get started quickly. Keep in mind, though, that a good eval will inevitably require careful thought and rigorous experimentation!
+
+## Disclaimer
+
+By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.
diff --git a/evals/SECURITY.md b/evals/SECURITY.md
new file mode 100644
index 0000000000000000000000000000000000000000..519fd9212ea898dbb56ab870b4e44a98eaa680bb
--- /dev/null
+++ b/evals/SECURITY.md
@@ -0,0 +1,4 @@
+# Security Policy
+For a more in-depth look at our security policy, please check out our [Coordinated Vulnerability Disclosure Policy](https://openai.com/security/disclosure/#:~:text=Disclosure%20Policy,-Security%20is%20essential&text=OpenAI%27s%20coordinated%20vulnerability%20disclosure%20policy,expect%20from%20us%20in%20return.).
+
+Our PGP key can located [at this address.](https://cdn.openai.com/security.txt)
diff --git a/evals/docs/build-eval.md b/evals/docs/build-eval.md
new file mode 100644
index 0000000000000000000000000000000000000000..65386fd09a3ffe47096559ac2caac422cce07ee2
--- /dev/null
+++ b/evals/docs/build-eval.md
@@ -0,0 +1,85 @@
+# Building an eval
+
+This document walks through the end-to-end process for building an eval, which is a dataset and a choice of eval class. The `examples` folder contains Jupyter notebooks that follow the steps below to build several academic evals, thus helping to illustrate the overall process.
+
+The steps in this process are building your dataset, registering a new eval with your dataset, and running your eval. Crucially, we assume that you are using an [existing eval template](eval-templates.md) out of the box (if that's not the case, see [this example of building a custom eval](custom-eval.md)). If you are interested in contributing your eval publically, we also include some criteria at the bottom for what we think makes an interesting eval.
+
+We are looking for evals in the following categories:
+
+- Over-refusals
+- Safety
+- System message steerability
+- In-the-wild hallucinations
+- Math / logical / physical reasoning
+- Real-world use case (please describe in your PR how this capability would be used in a product)
+- Other foundational capability
+
+If you have an eval that falls outside this category but still is a diverse example, please contribute it!
+
+## Formatting your data
+
+Once you have an eval in mind that you wish to implement, you will need to convert your samples into the right JSON lines (JSONL) format. A JSONL file is just a JSON file with a unique JSON object per line.
+
+We include some examples of JSONL eval files in [registry/data/README.md](../evals/registry/data/README.md)
+
+Each JSON object will represent one data point in your eval. The keys you need in the JSON object depend on the eval template. All templates expect an `"input"` key which is the prompt, ideally specified in [chat format](https://platform.openai.com/docs/guides/chat/introduction) (though strings are also supported). We recommend chat format even if you are evaluating non chat models. If you are evaluating both chat and non chat models, we handle the conversion between chat formatted prompts and raw string prompts (see the conversion logic [here](../evals/prompt/base.py)).
+
+For the basic evals `Match`, `Includes`, and `FuzzyMatch`, the other required key is `"ideal"`, which is a string (or a list of strings) specifying the correct reference answer(s). For model-graded evals, the required keys vary based on the eval but is determined by the `{key}`s in the evaluation `prompt` that are not covered by the (optional) `args`.
+
+We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to illustrate how the data should be formatted. See [`coqa/match.jsonl`](../evals/registry/data/coqa/match.jsonl) for an example of data that is suitable for the `Match` basic eval template and [`coqa/samples.jsonl`](../evals/registry/data/coqa/samples.jsonl) for data that is suitable for `fact` and `closedqa` model-graded evals. Note that even though these two model-graded evals expect different keys, we can include the superset of keys in our data in order to support both evals.
+
+If the dataset file is on your local machine, put the `jsonl` file in `evals/registry/data//samples.jsonl`. If it is in Cloud Object Storage, we support path-style URLs for the major clouds (for your personal use only, we will not accept PRs with cloud URLs).
+
+## Registering the eval
+
+Register the eval by adding a file to `evals/registry/evals/.yaml` using the elsuite registry format. For example, for a `Match` eval, it would be:
+```
+:
+ id: .dev.v0
+ metrics: [accuracy]
+
+.dev.v0:
+ class: evals.elsuite.basic.match:Match
+ args:
+ samples_jsonl: /samples.jsonl
+```
+
+Upon running the eval, the data will be searched for in `evals/registry/data`, e.g. if `test_match/samples.jsonl` is the provided filepath the data is expected to be in `evals/registry/data/test_match/samples.jsonl`.
+
+The naming convention for evals is in the form `..`.
+- `` is the eval name, used to group evals whose scores are comparable.
+- `` is the data split, used to further group evals that are under the same ``. E.g., "val", "test", or "dev" for testing.
+- `` is the version of the eval, which can be any descriptive text you'd like to use (though it's best if it does not contain ".").
+
+In general, running the same eval name against the same model should always give similar results so that others can reproduce it. Therefore, when you change your eval, you should bump the version.
+
+## Running the eval
+
+You can now run your eval on your data from the CLI with your choice of model:
+```
+oaieval gpt-3.5-turbo
+```
+Congratulations, you have built your eval! Keep iterating on it until you are confident in the results.
+
+## For model-graded evals: a step-by-step workflow
+
+We expect that the existing model-graded evals such as `fact`, `closedqa`, and `battle` will fit many use cases. However, other use cases may benefit from more customization, e.g., a different evaluation prompt. For these, there will be a bit more work involved, but generally still no coding required!
+
+1. If you can't use an existing model-graded eval, create a new YAML or create a new entry to an existing YAML in `evals/registry/modelgraded` to specify the [parameters](eval-templates.md#parameters-for-model-graded-evals) of your eval. See [`humor.yaml`](../evals/registry/modelgraded/humor.yaml) for an example.
+ - Note that, even if you are creating a new YAML, you may find it easiest to copy an existing YAML as a starting point. For example, model-graded evals which check a model completion against a rubric can copy `closedqa.yaml` and just edit the `args`.
+2. Next, you will create your dataset and register your eval, as described above. See [`joke_fruits_labeled.jsonl`](../evals/registry/data/test_metaeval/joke_fruits_labeled.jsonl) and [`joke-fruits`](../evals/registry/evals/test-modelgraded.yaml), for example.
+ - Note that it is recommended to specify `eval_type` at this step, when you register your eval, rather than step 1.
+3. Run your eval, e.g., `oaleval gpt-3.5-turbo joke-fruits`.
+4. (Recommended) Add a meta-eval for the model-graded eval! Each model-graded eval comes with a few knobs to tune, mainly `prompt` but also `eval_type`. In order to make sure the eval is of high quality, we recommend each model-graded eval contribution come with "choice labels", which are basically human-provided labels for which evaluation choice the model should have made. As an example (pretending that these jokes are actually funny), see the `"choice"` keys in [`joke_fruits_labeled.jsonl`](../evals/registry/data/test_metaeval/joke_fruits_labeled.jsonl), which are not used by the `joke-fruits` eval but are used by the [`joke-fruits-meta`](../evals/registry/evals/test-modelgraded.yaml) meta-eval right below it . After running the meta-eval, e.g., `oaieval gpt-3.5-turbo joke-fruits-meta`, the report will output `metascore/` accuracies, which should be close to "1.0" for a good model-graded eval.
+
+## Criteria for contributing an eval
+
+Important: if you are contributing code, make sure to run `pip install pre-commit; pre-commit install` before committing and pushing to ensure that `black`, `isort`, and `autoflake` are run.
+
+We are interested in curating a diverse and interesting set of evals on which to improve our models going forward. Here are some criteria for what we consider a good eval.
+- [ ] The eval should be thematically consistent. We'd like to see a number of prompts all revolving around the same use case, subject domain, failure mode, etc.
+- [ ] The eval should be challenging. If GPT-4 or GPT-3.5-Turbo do well on all of the prompts, this is not as interesting. Of course, the eval should also be possible given the models' limitations and constraints. Oftentimes, a good rule of thumb is whether a human (potentially a subject expert) could do well on the prompts.
+- [ ] The eval should be directionally clear. The data should include good signal around what is the right behavior. This means, for example, high-quality reference answers or an exhaustive rubric for evaluating answers.
+- [ ] The eval should be carefully crafted. Before you submit, you should think through whether you have engineered your prompts for good performance, whether you are using the best eval template, whether you have spot checked your results to ensure accuracy, etc.
+
+Once you are ready to contribute your eval publicly, submit a PR and the OpenAI team will be happy to look it over. Make sure to fill out all parts of the template that is prepopulated into the PR message. Note that submitting a PR does not guarantee that OpenAI will eventually merge it. We will run our own checks and use our best judgment when considering which evals to follow up with.
diff --git a/evals/docs/custom-eval.md b/evals/docs/custom-eval.md
new file mode 100644
index 0000000000000000000000000000000000000000..77c9753ecf4ed351dcf8354bfb8f850c61d0a622
--- /dev/null
+++ b/evals/docs/custom-eval.md
@@ -0,0 +1,148 @@
+# How to add a custom eval
+
+This tutorial will walk you through a simple example of writing and adding a custom eval. The example eval will test the model's ability to do basic arithmetic. We will assume that you have followed the setup instructions in the [README](../README.md) and gone through the other docs for how to run and build evals.
+
+When writing your own evals, the primary files of interest are:
+- `evals/api.py`, which provides common interfaces and utilities used by eval creators to sample from models and process the results,
+- `evals/record.py`, which defines the recorder classes which log eval results in different ways, such as to a local JSON file or to a remote Snowflake database, and
+- `evals/metrics.py`, which defines various common metrics of interest.
+
+These files provide a suite of tools for writing new evals. Once you have gone through this tutorial, you can see a more realistic example of these tools in action with the [machine translation](../evals/elsuite/translate.py) [eval example](../examples/lafand-mt.ipynb), which also implements custom eval logic in lieu of using an existing template.
+
+## Create your datasets
+
+The first step is to create the datasets for your eval. Here, we will create toy train and test sets of just two examples each. The test examples are what we will evaluate the model on, and we'll include the train examples as few-shot examples in the prompt to the model.
+
+We will use the new chat format described [here](https://platform.openai.com/docs/guides/chat/introduction). By default, we encourage all evals to be written using chat formatting if you want to evaluate our new models. Under the hood, we [convert](../evals/prompt/base.py) chat formatted data into raw strings for older non chat models.
+
+To create the toy datasets, in your terminal, type:
+```bash
+echo -e '[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]\n[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
+echo -e '[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]\n[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]' > /tmp/test.jsonl
+```
+
+## Create an eval
+
+The next step is to write a Python class that represents the actual evaluation. This class uses your datasets to create prompts, which are passed to the model to generate completions. Evaluation classes generally will inherit from the `evals.Eval` base class (defined in `evals/eval.py`) and will override two methods: `eval_sample` and `run`.
+
+Let's create a file called `arithmetic.py` under the `evals/elsuite` folder. We'll start by defining the eval class. Its `__init__` method will take in the arguments we need (references to the train and test sets) along with other `kwargs` that will be handled by the base class. We'll also define the `run` method which takes in a `recorder` and returns the final metrics of interest.
+
+```python
+import random
+import textwrap
+
+import evals
+import evals.metrics
+
+class Arithmetic(evals.Eval):
+ def __init__(self, train_jsonl, test_jsonl, train_samples_per_prompt=2, **kwargs):
+ super().__init__(**kwargs)
+ self.train_jsonl = train_jsonl
+ self.test_jsonl = test_jsonl
+ self.train_samples_per_prompt = train_samples_per_prompt
+
+ def run(self, recorder):
+ """
+ Called by the `oaieval` CLI to run the eval. The `eval_all_samples` method calls `eval_sample`.
+ """
+ self.train_samples = evals.get_jsonl(self.train_jsonl)
+ test_samples = evals.get_jsonl(self.test_jsonl)
+ self.eval_all_samples(recorder, test_samples)
+
+ # Record overall metrics
+ return {
+ "accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
+ }
+```
+
+Generally, most `run` methods will follow the same pattern shown here: loading the data, calling `eval_all_samples`, and aggregating the results (in this case, using the `get_accuracy` function in `evals/metrics.py`). `eval_all_samples` takes in both the `recorder` and the `test_samples` and, under the hood, will call the `eval_sample` method on each sample in `test_samples`. So let's write that `eval_sample` method now:
+
+```python
+ def eval_sample(self, test_sample, rng: random.Random):
+ """
+ Called by the `eval_all_samples` method to evaluate a single sample.
+
+ ARGS
+ ====
+ `test_sample`: a line from the JSONL test file
+ `rng`: should be used for any randomness that is needed during evaluation
+
+ This method does the following:
+ 1. Generate a prompt that contains the task statement, a few examples, and the test question.
+ 2. Check if the model generates the correct answer.
+ """
+ stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
+
+ prompt = [
+ {"role": "system", "content": "Solve the following math problems"},
+ ]
+
+ for i, sample in enumerate(stuffing + [test_sample]):
+ if i < len(stuffing):
+ prompt += [
+ {"role": "system", "content": sample["problem"], "name": "example_user"},
+ {"role": "system", "content": sample["answer"], "name": "example_assistant"},
+ ]
+ else:
+ prompt += [{"role": "user", "content": sample["problem"]}]
+
+ evals.check_sampled_text(self.model_spec, prompt, expected=sample["answer"])
+```
+You'll notice that `eval_sample` doesn't take the `recorder` as an argument. This is because `eval_all_samples` sets it to be the default recorder before calling `eval_sample`, and the recording utilities defined in `evals/record.py` use the default recorder. In this example, the `eval_sample` method passes off a lot of the heavy lifting to the `evals.check_sampled_text` utility function, which is defined in `evals/api.py`. This utility function queries the model, defined by `self.model_spec`, with the given `prompt` and checks to see if the result matches the `expected` answer (or one of them, if given a list). It then records these matches (or non matches) using the default recorder.
+
+`eval_sample` methods may vary greatly based on your use case. If you are building custom evals, it is a good idea to be familiar with the functions available to you in `evals/record.py`, `evals/metrics.py`, and especially `evals/api.py`.
+
+## Register your eval
+
+The next step is to register your eval in the registry so that it can be run using the `oaieval` CLI.
+
+Let's create a file called `arithmetic.yaml` under the `evals/registry/evals` folder and add an entry for our eval as follows:
+
+```yaml
+# Define a base eval
+arithmetic:
+ # id specifies the eval that this eval is an alias for
+ # in this case, arithmetic is an alias for arithmetic.dev.match-v1
+ # When you run `oaieval davinci arithmetic`, you are actually running `oaieval davinci arithmetic.dev.match-v1`
+ id: arithmetic.dev.match-v1
+ # The metrics that this eval records
+ # The first metric will be considered to be the primary metric
+ metrics: [accuracy]
+ description: Evaluate arithmetic ability
+# Define the eval
+arithmetic.dev.match-v1:
+ # Specify the class name as a dotted path to the module and class
+ class: evals.elsuite.arithmetic:Arithmetic
+ # Specify the arguments as a dictionary of JSONL URIs
+ # These arguments can be anything that you want to pass to the class constructor
+ args:
+ train_jsonl: /tmp/train.jsonl
+ test_jsonl: /tmp/test.jsonl
+```
+
+The `args` field should match the arguments that your eval class `__init__` method expects.
+
+## Run your eval
+
+The final step is to run your eval and view the results.
+
+```sh
+pip install . # you can omit this if you used `pip install -e .` to install
+oaieval gpt-3.5-turbo arithmetic
+```
+
+If you run with the `gpt-3.5-turbo` model, you should see an output similar to this (we have cleaned up the output here slightly for readability):
+
+```
+% oaieval gpt-3.5-turbo arithmetic
+... [registry.py:147] Loading registry from .../evals/registry/evals
+... [registry.py:147] Loading registry from .../.evals/evals
+... [oaieval.py:139] Run started:
+... [eval.py:32] Evaluating 2 samples
+... [eval.py:138] Running in threaded mode with 1 threads!
+100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 3.35it/s]
+... [record.py:320] Final report: {'accuracy': 1.0}. Logged to /tmp/evallogs/_gpt-3.5-turbo_arithmetic.jsonl
+... [oaieval.py:170] Final report:
+... [oaieval.py:172] accuracy: 1.0
+... [record.py:309] Logged 6 rows of events to /tmp/evallogs/_gpt-3.5-turbo_arithmetic.jsonl: insert_time=2.038ms
+```
diff --git a/evals/docs/eval-templates.md b/evals/docs/eval-templates.md
new file mode 100644
index 0000000000000000000000000000000000000000..ab507781ff7c619a42601822cab0a361844b197f
--- /dev/null
+++ b/evals/docs/eval-templates.md
@@ -0,0 +1,61 @@
+# Existing templates for evals
+
+In using Evals, we have discovered several "templates" that accommodate many different benchmarks. We have implemented these templates in `evals/elsuite` in order to simplify the development of new evals. We believe that, with these templates, many evals will not require any coding to implement! Instead, you'll pick one of the existing templates and simply specify the dataset and parameters.
+
+## Basic eval templates
+
+In cases where the desired model response has very little variation, such as answering multiple choice questions or simple questions with a straightforward answer, we have found the following templates to be useful.
+
+For a model completion `a` and a reference list of correct answers `B`, the following evals implement:
+- [`basic/match.py:Match`](../evals/elsuite/basic/match.py): `any([b.startswith(a) for b in B])`
+- [`basic/includes.py:Includes`](../evals/elsuite/basic/includes.py): `any([(a in b) for b in B])`
+- [`basic/fuzzy_match.py:FuzzyMatch`](../evals/elsuite/basic/fuzzy_match.py): `any([(a in b or b in a) for b in B])`
+
+Which eval template you use will depend on your use case. It is always recommended that you inspect the completions from your model, as this will help you determine how and whether to tweak your prompt (or your reference answers) and pick your eval template. Academic benchmarks oftentimes fit the mold of these basic evals, and we have implemented several end-to-end examples of academic evals as Jupyter notebooks in the `examples` folder.
+
+Sometimes, [custom eval logic](custom-eval.md) will better suit your needs. One example of this is the [machine translation](../evals/elsuite/translate.py) [eval example](../examples/lafand-mt.ipynb), in which there is a unique and clearly defined metric that we wish to use in our eval. You should use your best judgment when deciding between custom eval logic, using a basic eval template, or using model-graded evals as described next.
+
+## The model-graded eval template
+
+In cases where the desired model response can contain significant variation, such as answering an open-ended question, we have found that using the model to grade itself is a viable strategy for automated evaluation. In general, the evaluation model and the model being evaluated don't have to be the same, though we will assume that they are here for ease of explanation.
+
+[`modelgraded/classify.py:ModelBasedClassify`](../evals/elsuite/modelgraded/classify.py) implements the main logic behind our model-graded eval template. In short, we get the model's completion to the original prompt, wrap it in an evaluation prompt, and get the model's completion to the evaluation prompt, which we parse into our metrics of interest. Crucially, the evaluation prompt should prime the model to answer in such a way that is easily parsable, e.g., in multiple choice format or with a simple yes/no. We describe some example model-graded evals below, but first we specify the parameters for this eval template.
+
+### Parameters for model-graded evals
+
+Refer to the [`classify.py:ModelBasedClassify`](../evals/elsuite/modelgraded/classify.py) class to see how these parameters are used in the code.
+
+- `prompt`: The evaluation prompt which should take in the model's completion to the original prompt, potentially along with some other information, and steer the model to provide an evaluation that is easily parsable. Portions denoted by curly braces (i.e., `{key}`) are filled in either from the data `input_outputs` or the additional `args` (see below).
+- `input_outputs`: A mapping specifying which inputs to use to generate which completions. For many evals, there will only be a single input-completion pair, though there can be more, e.g., when comparing two completions against each other.
+- `choice_strings`: The choices that we expect the model completion to contain given the evaluation prompt. For example, `"ABCDE"` or `["Yes", "No", "Unsure"]`. Any other choices returned by the model are parsed into `"__invalid__"`.
+- `choice_scores` (optional): A mapping of each choice to its score, which is logged as a metric. For example, if a response of `"Yes"` (resp. `"No"`) indicates that the model's original completion was good (resp. bad), we may assign this choice a score of 1 (resp. 0).
+- `eval_type` (optional): How we expect the model to format its response to the evaluation prompt. Currently the supported options are:
+ - `"cot_classify"` ("chain-of-thought then classify", i.e., reason then answer) expects that the parsable portion of the response (i.e., the portion containing the choice) will be at the end of the completion. We recommend this as the default as it typically provides most accurate model-graded evaluations.
+ - `"classify_cot"` (answer then reason) expects that the model response will contain the choice first.
+ - `"classify"` expects that the model response will only contain the choice.
+
+ There are two ways to specify `eval_type`. The recommended way is in the `evals/registry/evals` YAML file. If done this way, an instruction will automatically be appended to `prompt` to steer the model towards the expected format (see `ANSWER_PROMPTS` in [the code](../evals/elsuite/modelgraded/classify.py)). Alternatively, you may specify `eval_type` in the `evals/registry/modelgraded` YAML, but you will need to include an appropriate instruction directly in the `prompt`.
+- `args` (optional): If specified, multiple evaluation calls will be made where the evaluation prompt is modified for each call with a different set of arguments.
+- `completion_sample_templates` (optional): If specified, determines how the model's output (or outputs, if `multicomp_n > 1`) will be formatted within the completion.
+
+### Example model-graded evals
+
+To instantiate model-graded evals, create a YAML file in `evals/registry/modelgraded` which specifies values for the arguments described above. We have provided a few examples, which illustrate the process for creating a model-graded eval, but which we also believe are general enough to be useful out of the box for many evals.
+
+[`fact.yaml`](../evals/registry/modelgraded/fact.yaml): a factual consistency eval which, given a completion `a` and reference answer `b`, returns:
+- `"A"` if `a` $\subseteq$ `b`, i.e., the submitted answer is a subset of the expert answer and is fully consistent with it.
+- `"B"` if `a` $\supseteq$ `b`, i.e., the submitted answer is a superset of the expert answer and is fully consistent with it.
+- `"C"` if `a` $=$ `b`, i.e., the submitted answer contains all the same details as the expert answer.
+- `"D"` if `a` $\neq$ `b`, i.e., there is a disagreement between the submitted answer and the expert answer.
+- `"E"` if `a` $\approx$ `b`, i.e., the answers differ, but these differences don't matter from the perspective of factuality.
+
+[`closedqa.yaml`](../evals/registry/modelgraded/closedqa.yaml): a question answering eval which, given a prompt containing a question and the necessary information to answer the question, checks whether the model's answer is:
+- relevant, i.e., extracted from the information provided in the prompt,
+- concise, i.e., did not contain unnecessary details or information, and
+- correct, i.e., uses the extracted information to come to the right conclusion.
+
+Note that this eval is implemented more generally as a "criteria-checking" eval which specifies the evaluation prompt as checking a given criterion and feeding in the above desiderata one by one. We believe that many other evals can be implemented by specifying a "rubric" detailing the criteria of interest and following the same prompt and yes/no choices.
+
+[`battle.yaml`](../evals/registry/modelgraded/battle.yaml): a head-to-head eval which compares two model completions for two potentially different prompts. `choice_scores` is used here to log how often the first completion is judged to be better than the second.
+
+We include additional examples which test more specific model capabilities (such as humor) and are thus less generalizable to other evals. However, these examples still serve to illustrate different ways to write evaluation prompts and set up model-graded evals. See [this section](build-eval.md#for-model-graded-evals-a-step-by-step-workflow) for more detailed steps on building model-graded evals.
diff --git a/evals/docs/run-evals.md b/evals/docs/run-evals.md
new file mode 100644
index 0000000000000000000000000000000000000000..958b466b896bca4649a5988bd57095d127fc2f50
--- /dev/null
+++ b/evals/docs/run-evals.md
@@ -0,0 +1,37 @@
+# How to run evals
+
+We provide two command line interfaces (CLIs): `oaieval` for running a single eval and `oaievalset` for running a set of evals.
+
+## Running an eval
+
+When using the `oaieval` command, you will need to provide both the model you wish to evaluate as well as the eval to run. E.g.,
+```sh
+oaieval gpt-3.5-turbo test-match
+```
+
+In this example, `gpt-3.5-turbo` is the model to evaluate, and `test-match` is the eval to run. The valid model names are those which you have access to via the API. The valid eval names are specified in the YAML files under `evals/registry/evals`, and their corresponding implementations can be found in `evals/elsuite`.
+
+These CLIs can accept various flags to modify their default behavior. For example:
+- If you wish to log to a Snowflake database (which you have already set up as described in the [README](../README.md)), add `--no-local-run`.
+- By default, logging locally or to Snowflake will write to `tmp/evallogs`, and you can change this by setting a different `--record_path`.
+
+You can run `oaieval --help` to see a full list of CLI options.
+
+## Running an eval set
+
+```sh
+oaievalset gpt-3.5-turbo test
+```
+
+Similarly, `oaievalset` also expects a model name and an eval set name, for which the valid options are specified in the YAML files under `evals/registry/eval_sets`.
+
+By default we run with 10 threads, and each thread times out and restarts after 40 seconds. You can configure this, e.g.,
+
+```sh
+EVALS_THREADS=42 EVALS_THREAD_TIMEOUT=600 oaievalset gpt-3.5-turbo test
+```
+Running with more threads will make the eval faster, though keep in mind the costs and your [rate limits](https://platform.openai.com/docs/guides/rate-limits/overview). Running with a higher thread timeout may be necessary if you expect each sample to take a long time, e.g., the data contain long prompts that elicit long responses from the model.
+
+If you have to stop your run or your run crashes, we've got you covered! `oaievalset` records the evals that finished in `/tmp/oaievalset/{model}.{eval_set}.progress.txt`. You can simply rerun the command to pick up where you left off. If you want to run the eval set starting from the beginning, delete this progress file.
+
+Unfortunately, you can't resume a single eval from the middle. You'll have to restart from the beginning, so try to keep your individual evals quick to run.
diff --git a/evals/evals/__init__.py b/evals/evals/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f21e6087208ed6212637ea6b13dace5382ff643f
--- /dev/null
+++ b/evals/evals/__init__.py
@@ -0,0 +1,4 @@
+from .api import check_sampled_text, completion_query, sample_freeform
+from .base import ModelSpec, ModelSpecs
+from .data import get_csv, get_json, get_jsonl, get_jsonls, get_lines, iter_jsonls
+from .eval import Eval
diff --git a/evals/evals/api.py b/evals/evals/api.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c45ffeacf0468ba79ff7f8a029c0c4839e6d940
--- /dev/null
+++ b/evals/evals/api.py
@@ -0,0 +1,263 @@
+"""
+This file provides common interfaces and utilities used by eval creators to
+sample from models and process the results.
+"""
+
+import logging
+from typing import Callable, Optional, Union
+
+from evals.base import ModelSpec
+from evals.prompt.base import (
+ ChatCompletionPrompt,
+ CompletionPrompt,
+ OpenAICreateChatPrompt,
+ OpenAICreatePrompt,
+ Prompt,
+)
+from evals.record import record_match, record_sampling
+from evals.utils.api_utils import (
+ openai_chat_completion_create_retrying,
+ openai_completion_create_retrying,
+ agi_completion_create_retrying,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def completion_query(
+ model_spec: ModelSpec,
+ prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt, Prompt],
+ **kwargs,
+) -> tuple[dict, Union[OpenAICreatePrompt, OpenAICreateChatPrompt], dict]:
+ """
+ Query the API for a completion.
+
+ ARGS
+ ====
+ `model_spec`: `ModelSpec` containing model details to use in the query.
+ This should be the dict returned by `registry.get_model()`.
+ If `model_spec` is not provided, we use the default model that was
+ intialized at the beginning of the run.
+ `prompt`: Either a `Prompt` object or a raw prompt that will get wrapped in
+ the approriate `Prompt` class.
+ `kwargs`: Other arguments passed to the API.
+
+ RETURNS
+ =======
+ The result of the API call.
+ The prompt that was fed into the API call as a str.
+ A dict containing metadata about the query.
+ """
+ if not isinstance(prompt, Prompt):
+ assert (
+ isinstance(prompt, str)
+ or (isinstance(prompt, list) and all(isinstance(token, int) for token in prompt))
+ or (isinstance(prompt, list) and all(isinstance(token, str) for token in prompt))
+ or (isinstance(prompt, list) and all(isinstance(msg, dict) for msg in prompt))
+ ), f"Got type {type(prompt)}, with val {type(prompt[0])} for prompt, expected str or list[int] or list[str] or list[dict[str, str]]"
+
+ if model_spec.is_chat:
+ prompt = ChatCompletionPrompt(
+ raw_prompt=prompt,
+ )
+ else:
+ prompt = CompletionPrompt(
+ raw_prompt=prompt,
+ )
+
+ openai_create_prompt: Union[
+ OpenAICreatePrompt, OpenAICreateChatPrompt
+ ] = prompt.to_openai_create_prompt()
+
+ extra_args = {
+ key: model_spec.extra_options.get(key, kwargs.get(key))
+ for key in set(kwargs) | set(model_spec.extra_options)
+ }
+
+ if model_spec.is_agi:
+ result = agi_completion_create_retrying(
+ model=model_spec.model,
+ api_base=model_spec.api_base,
+ messages=openai_create_prompt,
+ **extra_args,
+ )
+ elif model_spec.is_chat:
+ result = openai_chat_completion_create_retrying(
+ model=model_spec.model,
+ api_base=model_spec.api_base,
+ api_key=model_spec.api_key,
+ messages=openai_create_prompt,
+ **extra_args,
+ )
+ else:
+ result = openai_completion_create_retrying(
+ model=model_spec.model,
+ api_base=model_spec.api_base,
+ api_key=model_spec.api_key,
+ prompt=openai_create_prompt,
+ **extra_args,
+ )
+
+ metadata = {}
+ if result:
+ metadata["completion_id"] = result.get("id", None)
+ metadata["model"] = result.get("model", None)
+
+ if model_spec.is_chat:
+ for choice in result["choices"]:
+ choice["text"] = choice["message"]["content"]
+
+ return result, openai_create_prompt, metadata
+
+
+def check_sampled_text(
+ model_spec: ModelSpec,
+ prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt, Prompt],
+ expected: Union[str, list[str], tuple[str]],
+ *,
+ options: Optional[list[str]] = None,
+ separator: Callable[[str], bool] = None,
+) -> Optional[str]:
+ """
+ Generates a completion using the prompt, checks whether the completion is
+ one of the expected completions, and then records the result.
+
+ ARGS
+ ====
+ `model_spec`: See `completion_query`.
+ `prompt`: See `completion_query`.
+ `options`: The list of canonical options, defaults to `expected` if None.
+ The completion will be converted to one of these options.
+ `expected`: The desired completion or the list of desired completions.
+ `separator`: A callable which check the character sampled after the option
+ to see if it is a valid separator.
+
+ RETURNS
+ =======
+ The option that was picked, i.e., matched the completion, or None.
+ """
+ if isinstance(expected, tuple):
+ expected = list(expected)
+ elif not isinstance(expected, list):
+ expected = [expected]
+ if options is None:
+ options = expected
+
+ result, actual_prompt, metadata = completion_query(
+ prompt=prompt,
+ temperature=0.0,
+ model_spec=model_spec,
+ )
+ choice = result["choices"][0]
+
+ sampled = choice["text"].strip() if model_spec.strip_completion else choice["text"]
+
+ picked = None
+ for option in options:
+ if not sampled.startswith(option):
+ continue
+ if (
+ separator is not None
+ and len(sampled) > len(option)
+ and not separator(sampled[len(option)])
+ ):
+ continue
+ picked = option
+ break
+
+ result = {
+ "prompt": actual_prompt,
+ "sampled": sampled,
+ "options": options,
+ "picked": picked,
+ }
+ match = picked in expected
+ result["expected"] = expected
+ result["match"] = match
+ result["metadata"] = metadata
+ record_sampling(**result)
+ record_match(match, expected=expected, picked=picked, sampled=sampled)
+ return picked
+
+
+def sample_freeform(
+ model_spec: ModelSpec,
+ prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt, Prompt],
+ *,
+ temperature: float = 1.0,
+ top_p: float = 0.9,
+ max_tokens: int = 512,
+ stop: Optional[str] = None,
+ n_samples: int = None,
+ return_logprobs: bool = False,
+ **kwargs,
+) -> Union[str, list[str], dict]:
+ """
+ Samples a freeform response from the specified model, records the sampling,
+ and returns the sampled text.
+
+ ARGS
+ ====
+ `model_spec`: See `completion_query`.
+ `prompt`: See `completion_query`.
+ `temperature`: Passed to `openai.Completion.create`.
+ `top_p`: Passed to `openai.Completion.create`.
+ `max_tokens`: Passed to `openai.Completion.create`.
+ `stop`: Passed to `openai.Completion.create`.
+ `n_samples`: The number of samples to generate (1 if None).
+ `return_logprobs`: If True, returns the tokens and corresponding logprobs
+ in addition to the sampled text.
+ `kwargs`: See `completion_query`.
+
+ RETURNS
+ =======
+ If `return_logprobs` is True, returns a dict with the sampled text, tokens,
+ and corresponding logprobs. If `n_samples` is None, the outer list is
+ removed from all values.
+ Otherwise, returns the sampled text, or a list of sampled texts if
+ `n_samples` is not None.
+ """
+ response, actual_prompt, metadata = completion_query(
+ prompt=prompt,
+ temperature=temperature,
+ top_p=top_p,
+ max_tokens=max_tokens,
+ stop=stop,
+ n=(1 if n_samples is None else n_samples),
+ model_spec=model_spec,
+ headers={},
+ **kwargs,
+ )
+ sampled = [choice["text"] for choice in response["choices"]]
+ if n_samples is None:
+ sampled = sampled[0]
+ record_sampling(prompt=actual_prompt, sampled=sampled, metadata=metadata)
+
+ if return_logprobs:
+ assert not model_spec.is_chat, "logprobs only works for non-chat models"
+ assert not kwargs.get("logprobs") is None
+
+ def _maybe_tokens(logprobs: Optional[dict]) -> Optional[list[str]]:
+ return logprobs["tokens"] if logprobs is not None else None
+
+ def _maybe_logprobs(logprobs: Optional[dict]) -> Optional[list[float]]:
+ return logprobs["token_logprobs"] if logprobs is not None else None
+
+ def _maybe_top_logprobs(logprobs: Optional[dict]) -> Optional[list[dict[str, float]]]:
+ return [dict(x) for x in logprobs["top_logprobs"]] if logprobs is not None else None
+
+ tokens = [_maybe_tokens(choice["logprobs"]) for choice in response["choices"]]
+ logprobs = [_maybe_logprobs(choice["logprobs"]) for choice in response["choices"]]
+ top_logprobs = [_maybe_top_logprobs(choice["logprobs"]) for choice in response["choices"]]
+ if n_samples is None:
+ tokens = tokens[0]
+ logprobs = logprobs[0]
+ top_logprobs = top_logprobs[0]
+ return {
+ "text": sampled,
+ "tokens": tokens,
+ "logprobs": logprobs,
+ "top_logprobs": top_logprobs,
+ }
+
+ return sampled
diff --git a/evals/evals/base.py b/evals/evals/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a596c88feded1bd1ba16314482147ac621f02a96
--- /dev/null
+++ b/evals/evals/base.py
@@ -0,0 +1,153 @@
+"""
+This file defines the base specifications for models, evals, and runs. Running
+evals and most development work should not require familiarity with this file.
+"""
+import base64
+import datetime
+import os
+from typing import TYPE_CHECKING, Any, Dict, Mapping, Optional, Sequence
+
+if TYPE_CHECKING:
+ from dataclasses import dataclass
+else:
+ from pydantic.dataclasses import dataclass
+
+
+@dataclass
+class ModelSpec:
+ """
+ Specification for a model.
+ """
+
+ name: str
+ model: Optional[str] = None
+ api_base: Optional[str] = None
+
+ is_chat: bool = False
+ is_agi: bool = False
+
+ encoding: Optional[str] = None
+ organization: Optional[str] = None
+ api_key: Optional[str] = None
+ extra_options: Optional[Mapping[str, Any]] = None
+ metadata: Optional[Mapping[str, Any]] = None
+ headers: Optional[Mapping[str, Any]] = None
+ strip_completion: bool = True
+ n_ctx: Optional[int] = None
+ format: Optional[str] = None
+ key: Optional[str] = None
+ group: Optional[str] = None
+
+ def __post_init__(self):
+ if self.extra_options is None:
+ self.extra_options = {}
+ if self.headers is None:
+ self.headers = {}
+
+ if self.model is None:
+ raise ValueError(f"Must specify a model")
+
+
+@dataclass
+class BaseEvalSpec:
+ """
+ Specification for a base eval.
+ """
+
+ id: Optional[str] = None
+ metrics: Optional[Sequence[str]] = None
+ description: Optional[str] = None
+ disclaimer: Optional[str] = None
+
+ """
+ True if higher values are better, False if lower values are better.
+ This should really be part of a metric, but it's easier to put it here.
+ """
+ higher_is_better: bool = True
+
+ key: Optional[str] = None
+ group: Optional[str] = None
+
+
+@dataclass
+class EvalSpec:
+ """
+ Specification for an eval.
+ """
+
+ cls: str
+ args: Optional[Dict[str, Any]] = None
+ key: Optional[str] = None
+ group: Optional[str] = None
+
+
+@dataclass
+class EvalSetSpec:
+ """
+ Specification for an eval set.
+ """
+
+ evals: Sequence[str]
+ key: Optional[str] = None
+ group: Optional[str] = None
+
+
+@dataclass
+class ModelSpecs:
+ completions_: Optional[Sequence[ModelSpec]] = None
+ embedding_: Optional[ModelSpec] = None
+ ranking_: Optional[ModelSpec] = None
+
+ @property
+ def embedding(self) -> ModelSpec:
+ if self.embedding_ is None:
+ raise ValueError("Embedding model was not specified")
+ return self.embedding_
+
+ @property
+ def ranking(self) -> ModelSpec:
+ if self.ranking_ is None:
+ raise ValueError("Ranking model was not specified")
+ return self.ranking_
+
+ @property
+ def completion(self) -> ModelSpec:
+ if self.completions_ is None:
+ raise ValueError("Completion model was not specified")
+ return self.completions_[0]
+
+ @property
+ def completions(self) -> Sequence[ModelSpec]:
+ if self.completions_ is None:
+ raise ValueError("Completion model was not specified")
+ return self.completions_
+
+ @property
+ def names(self) -> dict[str, Sequence[str]]:
+ dict = {}
+ if self.completions_ is not None:
+ dict["completions"] = [model.name for model in self.completions_]
+ if self.embedding_ is not None:
+ dict["embedding"] = [self.embedding_.name]
+ if self.ranking_ is not None:
+ dict["ranking"] = [self.ranking_.name]
+ return dict
+
+
+@dataclass
+class RunSpec:
+ model_name: str
+ model_names: dict[str, Sequence[str]]
+ eval_name: str
+ base_eval: str
+ split: str
+ run_config: Dict[str, Any]
+ created_by: str
+ run_id: str = None
+ created_at: str = None
+
+ def __post_init__(self):
+ now = datetime.datetime.utcnow()
+ rand_suffix = base64.b32encode(os.urandom(5)).decode("ascii")
+ self.run_id = now.strftime("%y%m%d%H%M%S") + rand_suffix
+ self.created_at = str(now)
diff --git a/evals/evals/cli/oaieval.py b/evals/evals/cli/oaieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..0105524d19db791d1ee9a2b80cb66b52952b2b81
--- /dev/null
+++ b/evals/evals/cli/oaieval.py
@@ -0,0 +1,274 @@
+"""
+This file defines the `oaieval` CLI for running evals.
+"""
+import argparse
+import logging
+import shlex
+import sys
+from functools import cached_property
+from typing import Any, Mapping, Optional
+
+import openai
+
+import evals
+import evals.api
+import evals.base
+import evals.record
+from evals.base import ModelSpec, ModelSpecs
+from evals.registry import Registry
+
+logger = logging.getLogger(__name__)
+
+
+def _purple(str):
+ return f"\033[1;35m{str}\033[0m"
+
+
+def get_parser() -> argparse.ArgumentParser:
+ parser = argparse.ArgumentParser(description="Run evals through the API")
+ parser.add_argument("model", type=str, help="Name of a completion model.")
+ parser.add_argument("eval", type=str, help="Name of an eval. See registry.")
+ parser.add_argument("--embedding_model", type=str, default="")
+ parser.add_argument("--ranking_model", type=str, default="")
+ parser.add_argument("--extra_eval_params", type=str, default="")
+ parser.add_argument("--modelspec_extra_options", type=str, default="")
+ parser.add_argument("--max_samples", type=int, default=None)
+ parser.add_argument("--cache", action=argparse.BooleanOptionalAction, default=True)
+ parser.add_argument("--visible", action=argparse.BooleanOptionalAction, default=None)
+ parser.add_argument("--seed", type=int, default=20220722)
+ parser.add_argument("--user", type=str, default="")
+ parser.add_argument("--record_path", type=str, default=None)
+ parser.add_argument(
+ "--log_to_file", type=str, default=None, help="Log to a file instead of stdout"
+ )
+ parser.add_argument("--debug", action=argparse.BooleanOptionalAction, default=False)
+ parser.add_argument("--local-run", action=argparse.BooleanOptionalAction, default=True)
+ parser.add_argument("--dry-run", action=argparse.BooleanOptionalAction, default=False)
+ parser.add_argument("--dry-run-logging", action=argparse.BooleanOptionalAction, default=True)
+ return parser
+
+
+def parse_extra_eval_params(param_str: Optional[str]) -> Mapping[str, Any]:
+ """Parse a string of the form "key1=value1,key2=value2" into a dict."""
+ if not param_str:
+ return {}
+
+ def to_number(x):
+ try:
+ return int(x)
+ except:
+ pass
+ try:
+ return float(x)
+ except:
+ pass
+ return x
+
+ str_dict = dict(kv.split("=") for kv in param_str.split(","))
+ return {k: to_number(v) for k, v in str_dict.items()}
+
+
+def n_ctx_from_model_name(model_name: str) -> Optional[int]:
+ """Returns n_ctx for a given API model name. Model list last updated 2023-03-14."""
+ # note that for most models, the max tokens is n_ctx + 1
+ DICT_OF_N_CTX_BY_MODEL_NAME_PREFIX: dict[str, int] = {
+ "dummy-": 2048,
+ "gpt-3.5-turbo-": 4096,
+ "gpt-4-": 8192,
+ "gpt-4-32k-": 32768,
+ "agi-":128,
+ }
+ DICT_OF_N_CTX_BY_MODEL_NAME: dict[str, int] = {
+ "ada": 2048,
+ "text-ada-001": 2048,
+ "babbage": 2048,
+ "text-babbage-001": 2048,
+ "curie": 2048,
+ "text-curie-001": 2048,
+ "davinci": 2048,
+ "text-davinci-001": 2048,
+ "code-davinci-002": 8000,
+ "text-davinci-002": 4096,
+ "text-davinci-003": 4096,
+ "gpt-3.5-turbo": 4096,
+ "gpt-3.5-turbo-0301": 4096,
+ "gpt-4": 8192,
+ "gpt-4-0314": 8192,
+ "gpt-4-32k": 32768,
+ "gpt-4-32k-0314": 32768,
+ "agi-7B": 128,
+ "agi-13B": 128,
+ "agi-17B": 128,
+ "agi-30B": 128,
+ "agi-65B": 128,
+ }
+ # first, look for a prefix match
+ for model_prefix, n_ctx in DICT_OF_N_CTX_BY_MODEL_NAME_PREFIX.items():
+ if model_name.startswith(model_prefix):
+ return n_ctx
+ # otherwise, look for an exact match and return None if not found
+ return DICT_OF_N_CTX_BY_MODEL_NAME.get(model_name, None)
+
+
+class ModelResolver:
+ # This is a temporary method to identify which models are chat models.
+ # Eventually, the OpenAI API should expose this information directly.
+ CHAT_MODELS = {
+ "gpt-3.5-turbo",
+ "gpt-3.5-turbo-0301",
+ "gpt-4",
+ "gpt-4-0314",
+ "gpt-4-32k",
+ "gpt-4-32k-0314",
+ "dummy-chat",
+ "agi-7B",
+ "agi-13B",
+ "agi-17B",
+ "agi-30B",
+ "agi-65B",
+ }
+
+ AGI_MODELS = {
+ "agi-7B",
+ "agi-13B",
+ "agi-17B",
+ "agi-30B",
+ "agi-65B",
+ }
+
+ AGI_MODEL_IDS = [model for model in AGI_MODELS]
+
+ DUMMY_MODELS = {
+ "dummy-chat",
+ "dummy-completion",
+ }
+
+ def resolve(self, name: str) -> ModelSpec:
+ if name in self.DUMMY_MODELS:
+ result = ModelSpec(name=name, model=name, is_chat=(name in self.CHAT_MODELS))
+ return result
+
+ if name in self.api_model_ids:
+ result = ModelSpec(
+ name=name,
+ model=name,
+ is_chat=(name in self.CHAT_MODELS),
+ is_agi=(name in self.AGI_MODELS),
+ n_ctx=n_ctx_from_model_name(name),
+ )
+ return result
+
+ raise ValueError(f"Couldn't find model: {name}")
+
+ @cached_property
+ def api_model_ids(self):
+ return([m["id"] for m in openai.Model.list()["data"]] + self.AGI_MODEL_IDS)
+
+
+def run(args, model_resolver: ModelResolver, registry: Optional[Registry] = None):
+ if args.debug:
+ logging.getLogger().setLevel(logging.DEBUG)
+
+ visible = args.visible if args.visible is not None else (args.max_samples is None)
+
+ if args.max_samples is not None:
+ evals.eval.set_max_samples(args.max_samples)
+
+ registry = registry or Registry()
+ eval_spec = registry.get_eval(args.eval)
+ assert (
+ eval_spec is not None
+ ), f"Eval {args.eval} not found. Available: {list(sorted(registry._evals.keys()))}"
+
+ def get_model(name: str) -> ModelSpec:
+ return model_resolver.resolve(name)
+
+ completion_model_specs = [get_model(model) for model in args.model.split(",")]
+
+ for spec in completion_model_specs:
+ spec.extra_options = parse_extra_eval_params(args.modelspec_extra_options)
+
+ model_specs = ModelSpecs(
+ completions_=completion_model_specs,
+ embedding_=get_model(args.embedding_model) if args.embedding_model else None,
+ ranking_=get_model(args.ranking_model) if args.ranking_model else None,
+ )
+
+ run_config = {
+ "model_specs": model_specs,
+ "eval_spec": eval_spec,
+ "seed": args.seed,
+ "max_samples": args.max_samples,
+ "command": " ".join(map(shlex.quote, sys.argv)),
+ "initial_settings": {
+ "visible": visible,
+ },
+ }
+
+ model_name = model_specs.completions_[0].name if len(model_specs.completions_) > 0 else "n/a"
+ eval_name = eval_spec.key
+ run_spec = evals.base.RunSpec(
+ model_name=model_name,
+ model_names=model_specs.names,
+ eval_name=eval_name,
+ base_eval=eval_name.split(".")[0],
+ split=eval_name.split(".")[1],
+ run_config=run_config,
+ created_by=args.user,
+ )
+ if args.record_path is None:
+ record_path = f"/tmp/evallogs/{run_spec.run_id}_{args.model}_{args.eval}.jsonl"
+ else:
+ record_path = args.record_path
+ if args.dry_run:
+ recorder = evals.record.DummyRecorder(run_spec=run_spec, log=args.dry_run_logging)
+ elif args.local_run:
+ recorder = evals.record.LocalRecorder(record_path, run_spec=run_spec)
+ else:
+ recorder = evals.record.Recorder(record_path, run_spec=run_spec)
+
+ api_extra_options = {}
+ if not args.cache:
+ api_extra_options["cache_level"] = 0
+
+ run_url = f"{run_spec.run_id}"
+ logger.info(_purple(f"Run started: {run_url}"))
+
+ extra_eval_params = parse_extra_eval_params(args.extra_eval_params)
+
+ eval_class = registry.get_class(eval_spec)
+ eval = eval_class(
+ model_specs=model_specs,
+ seed=args.seed,
+ name=eval_name,
+ registry=registry,
+ **extra_eval_params,
+ )
+ result = eval.run(recorder)
+ recorder.record_final_report(result)
+
+ if not (args.dry_run or args.local_run):
+ logger.info(_purple(f"Run completed: {run_url}"))
+
+ logger.info("Final report:")
+ for key, value in result.items():
+ logger.info(f"{key}: {value}")
+ return run_spec.run_id
+
+
+def main():
+ parser = get_parser()
+ args = parser.parse_args(sys.argv[1:])
+ logging.basicConfig(
+ format="[%(asctime)s] [%(filename)s:%(lineno)d] %(message)s",
+ level=logging.INFO,
+ filename=args.log_to_file if args.log_to_file else None,
+ )
+ logging.getLogger("openai").setLevel(logging.WARN)
+ if hasattr(openai.error, "set_display_cause"):
+ openai.error.set_display_cause()
+ run(args, model_resolver=ModelResolver())
+
+
+if __name__ == "__main__":
+ main()
diff --git a/evals/evals/cli/oaievalset.py b/evals/evals/cli/oaievalset.py
new file mode 100644
index 0000000000000000000000000000000000000000..72af659998d75142c401d500b160ddb47be6a0f2
--- /dev/null
+++ b/evals/evals/cli/oaievalset.py
@@ -0,0 +1,105 @@
+"""
+This file defines the `oaievalset` CLI for running eval sets.
+"""
+import argparse
+import json
+import subprocess
+from pathlib import Path
+from typing import Optional
+
+from evals.registry import Registry
+
+Task = list[str]
+
+
+class Progress:
+ def __init__(self, file: str) -> None:
+ self.file = Path(file)
+ self.completed: list[Task] = []
+
+ def load(self) -> bool:
+ if not self.file.exists():
+ return False
+
+ with self.file.open() as f:
+ for line in f:
+ self.completed.append(json.loads(line))
+ return len(self.completed) > 0
+
+ def add(self, item: Task) -> None:
+ self.completed.append(item)
+ self.save()
+
+ def save(self) -> None:
+ self.file.parent.mkdir(parents=True, exist_ok=True)
+ with self.file.open("w") as f:
+ for item in self.completed:
+ f.write(json.dumps(item) + "\n")
+ print(highlight(f"Saved progress to {self.file}"))
+
+
+def highlight(str: str) -> str:
+ return f"\033[1;32m>>> {str}\033[0m"
+
+
+def get_parser() -> argparse.ArgumentParser:
+ parser = argparse.ArgumentParser(description="Run eval sets through the API")
+ parser.add_argument("model", type=str, help="Name of a completion model.")
+ parser.add_argument("eval_set", type=str, help="Name of eval set. See registry.")
+ parser.add_argument(
+ "--resume",
+ action=argparse.BooleanOptionalAction,
+ default=True,
+ help="Resume from last checkpoint.",
+ )
+ parser.add_argument(
+ "--exit-on-error",
+ action=argparse.BooleanOptionalAction,
+ default=True,
+ help="Exit if any oaieval command fails.",
+ )
+ return parser
+
+
+def run(args, unknown_args, registry: Optional[Registry] = None) -> None:
+ registry = registry or Registry()
+ commands: list[Task] = []
+ eval_set = registry.get_eval_set(args.eval_set)
+ for eval in registry.get_evals(eval_set.evals):
+ command = ["oaieval", args.model, eval.key] + unknown_args
+ if command in commands:
+ continue
+ commands.append(command)
+ num_evals = len(commands)
+
+ progress = Progress(f"/tmp/oaievalset/{args.model}.{args.eval_set}.progress.txt")
+ if args.resume and progress.load():
+ print(f"Loaded progress from {progress.file}")
+ print(f"{len(progress.completed)}/{len(commands)} evals already completed:")
+ for item in progress.completed:
+ print(" " + " ".join(item))
+
+ commands = [c for c in commands if c not in progress.completed]
+ command_strs = [" ".join(cmd) for cmd in commands]
+ print("Going to run the following commands:")
+ for command_str in command_strs:
+ print(" " + command_str)
+
+ num_already_completed = num_evals - len(commands)
+ for idx, command in enumerate(commands):
+ real_idx = idx + num_already_completed
+ print(highlight("Running command: " + " ".join(command) + f" ({real_idx+1}/{num_evals})"))
+ subprocess.run(command, stdout=subprocess.PIPE, check=args.exit_on_error)
+ progress.add(command)
+
+ print(highlight("All done!"))
+
+
+def main() -> None:
+ parser = get_parser()
+ args, unknown_args = parser.parse_known_args()
+ run(args, unknown_args)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/evals/evals/data.py b/evals/evals/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0085bae8b9aaa2216e1bbe7f9cdd85530acf166
--- /dev/null
+++ b/evals/evals/data.py
@@ -0,0 +1,189 @@
+"""
+This file defines utilities for working with data and files of various types.
+"""
+import csv
+import dataclasses
+import gzip
+import itertools
+import json
+import logging
+import os
+import urllib
+from collections.abc import Iterator
+from functools import partial
+from typing import Any, Sequence, Union
+
+import blobfile as bf
+import lz4.frame
+import pydantic
+import pyzstd
+
+logger = logging.getLogger(__name__)
+
+
+def gzip_open(filename: str, mode: str = "rb", openhook: Any = open) -> gzip.GzipFile:
+ """Wrap the given openhook in gzip."""
+ if mode and "b" not in mode:
+ mode += "b"
+
+ return gzip.GzipFile(fileobj=openhook(filename, mode), mode=mode)
+
+
+def lz4_open(filename: str, mode: str = "rb", openhook: Any = open) -> lz4.frame.LZ4FrameFile:
+ if mode and "b" not in mode:
+ mode += "b"
+
+ return lz4.frame.LZ4FrameFile(openhook(filename, mode), mode=mode)
+
+
+def zstd_open(filename: str, mode: str = "rb", openhook: Any = open) -> pyzstd.ZstdFile:
+ if mode and "b" not in mode:
+ mode += "b"
+
+ return pyzstd.ZstdFile(openhook(filename, mode), mode=mode)
+
+
+def open_by_file_pattern(filename: str, mode: str = "r", **kwargs: Any) -> Any:
+ """Can read/write to files on gcs/local with or without gzipping. If file
+ is stored on gcs, streams with blobfile. Otherwise use vanilla python open. If
+ filename endswith gz, then zip/unzip contents on the fly (note that gcs paths and
+ gzip are compatible)"""
+ open_fn = partial(bf.BlobFile, **kwargs)
+ try:
+ if filename.endswith(".gz"):
+ return gzip_open(filename, openhook=open_fn, mode=mode)
+ elif filename.endswith(".lz4"):
+ return lz4_open(filename, openhook=open_fn, mode=mode)
+ elif filename.endswith(".zst"):
+ return zstd_open(filename, openhook=open_fn, mode=mode)
+ else:
+ scheme = urllib.parse.urlparse(filename).scheme
+ if scheme == "" or scheme == "file":
+ return open_fn(
+ os.path.join(
+ os.path.dirname(os.path.abspath(__file__)), "registry", "data", filename
+ ),
+ mode=mode,
+ )
+ else:
+ return open_fn(filename, mode=mode)
+ except Exception as e:
+ raise RuntimeError(f"Failed to open: {filename}") from e
+
+
+def _get_jsonl_file(path):
+ logger.info(f"Fetching {path}")
+ with open_by_file_pattern(path, mode="r") as f:
+ return list(map(json.loads, f.readlines()))
+
+
+def _get_json_file(path):
+ logger.info(f"Fetching {path}")
+ with open_by_file_pattern(path, mode="r") as f:
+ return json.loads(f.read())
+
+
+def _stream_jsonl_file(path) -> Iterator:
+ logger.info(f"Streaming {path}")
+ with bf.BlobFile(path, "r", streaming=True) as f:
+ for line in f:
+ yield json.loads(line)
+
+
+def get_lines(path) -> list[dict]:
+ """
+ Get a list of lines from a file.
+ """
+ with open_by_file_pattern(path, mode="r") as f:
+ return f.readlines()
+
+
+def get_jsonl(path: str) -> list[dict]:
+ """
+ Extract json lines from the given path.
+ If the path is a directory, look in subpaths recursively.
+
+ Return all lines from all jsonl files as a single list.
+ """
+ if bf.isdir(path):
+ result = []
+ for filename in bf.listdir(path):
+ if filename.endswith(".jsonl"):
+ result += get_jsonl(os.path.join(path, filename))
+ return result
+ return _get_jsonl_file(path)
+
+
+def get_jsonls(paths: Sequence[str], line_limit=None) -> list[dict]:
+ return list(iter_jsonls(paths, line_limit))
+
+
+def get_json(path) -> dict:
+ if bf.isdir(path):
+ raise ValueError("Path is a directory, only files are supported")
+ return _get_json_file(path)
+
+
+def iter_jsonls(paths: Union[str, list[str]], line_limit=None) -> Iterator[dict]:
+ """
+ For each path in the input, iterate over the jsonl files in that path.
+ Look in subdirectories recursively.
+
+ Use an iterator to conserve memory.
+ """
+ if type(paths) == str:
+ paths = [paths]
+
+ def _iter():
+ for path in paths:
+ if bf.isdir(path):
+ for filename in bf.listdir(path):
+ if filename.endswith(".jsonl"):
+ yield from iter_jsonls([os.path.join(path, filename)])
+ else:
+ yield from _stream_jsonl_file(path)
+
+ return itertools.islice(_iter(), line_limit)
+
+
+def get_csv(path, fieldnames=None):
+ with bf.BlobFile(path, "r", cache_dir="/tmp/bf_cache", streaming=False) as f:
+ reader = csv.DictReader(f, fieldnames=fieldnames)
+ return [row for row in reader]
+
+
+def _to_py_types(o: Any) -> Any:
+ if isinstance(o, dict):
+ return {k: _to_py_types(v) for k, v in o.items()}
+ if isinstance(o, list):
+ return [_to_py_types(v) for v in o]
+
+ if dataclasses.is_dataclass(o):
+ return _to_py_types(dataclasses.asdict(o))
+
+ # pydantic data classes
+ if isinstance(o, pydantic.BaseModel):
+ return json.loads(o.json())
+
+ return o
+
+
+class EnhancedJSONEncoder(json.JSONEncoder):
+ def default(self, o: Any) -> str:
+ return _to_py_types(o)
+
+
+def jsondumps(o: Any, ensure_ascii: bool = False, **kwargs: Any) -> str:
+ return json.dumps(o, cls=EnhancedJSONEncoder, ensure_ascii=ensure_ascii, **kwargs)
+
+
+def jsondump(o: Any, fp: Any, ensure_ascii: bool = False, **kwargs: Any) -> None:
+ json.dump(o, fp, cls=EnhancedJSONEncoder, ensure_ascii=ensure_ascii, **kwargs)
+
+
+def jsonloads(s: str, **kwargs: Any) -> Any:
+ return json.loads(s, **kwargs)
+
+
+def jsonload(fp: Any, **kwargs: Any) -> Any:
+ return json.load(fp, **kwargs)
diff --git a/evals/evals/elsuite/basic/fuzzy_match.py b/evals/evals/elsuite/basic/fuzzy_match.py
new file mode 100644
index 0000000000000000000000000000000000000000..fdee1092a61e7b0e54118cb75dac26023574f10d
--- /dev/null
+++ b/evals/evals/elsuite/basic/fuzzy_match.py
@@ -0,0 +1,49 @@
+import evals
+import numpy as np
+from evals.elsuite import utils
+from evals.record import RecorderBase
+
+
+class FuzzyMatch(evals.Eval):
+ def __init__(
+ self,
+ model_specs: evals.ModelSpecs,
+ samples_jsonl: str,
+ *args,
+ max_tokens: int = 500,
+ **kwargs,
+ ):
+ super().__init__(model_specs, *args, **kwargs)
+ self.max_tokens = max_tokens
+ self.samples_jsonl = samples_jsonl
+
+ def eval_sample(self, test_sample, rng):
+ prompt, correct_answers = test_sample["input"], test_sample["ideal"]
+ generated_answer = evals.sample_freeform(
+ self.model_spec,
+ prompt,
+ temperature=0.0,
+ max_tokens=16,
+ )
+ matches = [
+ utils.fuzzy_match(generated_answer, correct_answer)
+ for correct_answer in correct_answers
+ ]
+ evals.record.record_match(
+ True in matches,
+ expected=correct_answers,
+ picked=[generated_answer for i in range(len(correct_answers)) if matches[i]],
+ )
+ evals.record.record_metrics(
+ accuracy=float(True in matches),
+ f1_score=utils.f1_score(generated_answer, correct_answers),
+ )
+
+ def run(self, recorder: RecorderBase):
+ samples = evals.get_jsonl(self.samples_jsonl)
+ self.eval_all_samples(recorder, samples)
+
+ return {
+ "accuracy": np.mean(recorder.get_scores("accuracy")),
+ "f1_score": np.mean(recorder.get_scores("f1_score")),
+ }
diff --git a/evals/evals/elsuite/basic/includes.py b/evals/evals/elsuite/basic/includes.py
new file mode 100644
index 0000000000000000000000000000000000000000..af16600628178525311f2b1aa1afb13a0c57eeee
--- /dev/null
+++ b/evals/evals/elsuite/basic/includes.py
@@ -0,0 +1,38 @@
+from typing import Any
+
+import evals
+import evals.elsuite.utils
+import evals.metrics
+import numpy as np
+
+
+class Includes(evals.Eval):
+ def __init__(
+ self,
+ model_specs: evals.ModelSpecs,
+ samples_jsonl: str,
+ *args,
+ max_tokens: int = 500,
+ **kwargs,
+ ):
+ super().__init__(model_specs, *args, **kwargs)
+ self.max_tokens = max_tokens
+ self.samples_jsonl = samples_jsonl
+
+ def eval_sample(self, sample: Any, *_):
+ sampled = evals.sample_freeform(
+ self.model_spec, sample["input"], max_tokens=self.max_tokens
+ )
+ includes_answer = any(
+ [evals.elsuite.utils.get_answer(sampled, ref) for ref in sample["ideal"]]
+ )
+ evals.record.record_metrics(accuracy=float(includes_answer))
+ return includes_answer
+
+ def run(self, recorder):
+ samples = evals.get_jsonl(self.samples_jsonl)
+ self.eval_all_samples(recorder, samples)
+ events = recorder.get_scores("accuracy")
+ return {
+ "accuracy": np.mean(events),
+ }
diff --git a/evals/evals/elsuite/basic/match.py b/evals/evals/elsuite/basic/match.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecd5092ac62eec3ecf177eb34e209c2f2dd27709
--- /dev/null
+++ b/evals/evals/elsuite/basic/match.py
@@ -0,0 +1,45 @@
+from typing import Any
+
+import evals
+import evals.metrics
+from evals.prompt.base import is_chat_prompt
+
+
+class Match(evals.Eval):
+ def __init__(
+ self,
+ model_specs: evals.ModelSpecs,
+ samples_jsonl: str,
+ *args,
+ max_tokens: int = 500,
+ num_few_shot: int = 0,
+ few_shot_jsonl: str = None,
+ **kwargs,
+ ):
+ super().__init__(model_specs, *args, **kwargs)
+ self.max_tokens = max_tokens
+ self.samples_jsonl = samples_jsonl
+ self.num_few_shot = num_few_shot
+ if self.num_few_shot > 0:
+ assert few_shot_jsonl is not None, "few shot requires few shot sample dataset"
+ self.few_shot_jsonl = few_shot_jsonl
+ self.few_shot = evals.get_jsonl(self.few_shot_jsonl)
+
+ def eval_sample(self, sample: Any, *_):
+ prompt = sample["input"]
+ if self.num_few_shot > 0:
+ assert is_chat_prompt(sample["input"]), "few shot requires chat prompt"
+ prompt = sample["input"][:-1]
+ for s in self.few_shot[: self.num_few_shot]:
+ prompt += s["sample"]
+ prompt += sample["input"][-1:]
+
+ return evals.check_sampled_text(self.model_spec, prompt, expected=sample["ideal"])
+
+ def run(self, recorder):
+ samples = evals.get_jsonl(self.samples_jsonl)
+ self.eval_all_samples(recorder, samples)
+ events = recorder.get_events("match")
+ return {
+ "accuracy": evals.metrics.get_accuracy(events),
+ }
diff --git a/evals/evals/elsuite/modelgraded/classify.py b/evals/evals/elsuite/modelgraded/classify.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e6d29fd712196ad70d09b4812d5be43445113a5
--- /dev/null
+++ b/evals/evals/elsuite/modelgraded/classify.py
@@ -0,0 +1,356 @@
+"""
+Generic eval that uses a prompt + classification.
+"""
+import copy
+import itertools
+import logging
+import string
+from collections import Counter
+from random import Random
+from typing import Callable, Iterable, Optional, Union
+
+import openai
+
+import evals
+import evals.record
+from evals.base import ModelSpec
+from evals.elsuite.utils import PromptFn, format_necessary, scrub_formatting_from_prompt
+
+INVALID_STR = "__invalid__"
+CHOICE_KEY = "choice"
+MATCH_FNS = {
+ "include": lambda x, y: float(x in y),
+ "exact": lambda x, y: float(x == y),
+ "endswith": lambda x, y: x.endswith(y),
+ "starts_or_endswith": lambda x, y: x.startswith(y) or x.endswith(y),
+}
+
+ANSWER_PROMPTS = {
+ # e.g. "Yes"
+ "classify": "Answer the question by printing only a single choice from {choices} (without quotes or punctuation) corresponding to the correct answer with no other text.".strip(),
+ # e.g. "Yes\n The reasons are: ..."
+ "classify_cot": "First, answer by printing a single choice from {choices} (without quotes or punctuation) corresponding to the correct answer. Then, from the next line, explain your reasonings step by step.".strip(),
+ # e.g. "Let's think step by step. ...\nYes"
+ "cot_classify": """
+First, write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Then print only a single choice from {choices} (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the answer by itself on a new line.
+
+Reasoning:""".strip(),
+ "cot_classify_jp": """
+まず、一歩一歩あなたの推論を書き出してください。単に正しい答えを最初に述べることを避けてください。次に、{choices}(引用符や句読点なし)から正しい答えに対応する1つの選択肢を単独の行に書きだしてください。最後に、答えだけを新しい行に繰り返してください。
+
+推論:
+ """.strip(),
+}
+
+
+def choice_to_str(choice_strings: Iterable[str]) -> str:
+ """Return a string of choices, e.g. '"Yes" or "No" or "Maybe"'."""
+ return " or ".join(f'"{choice}"' for choice in choice_strings)
+
+
+def get_choice(text: str, eval_type: str, match_fn: Callable, choice_strings: Iterable[str]) -> str:
+ """Clean the answer string to a choice string to one of choice_strings. Return '__invalid__.' if no match."""
+ lines = text.strip().split("\n")
+ if eval_type.startswith("cot_classify"):
+ lines = lines[::-1] # reverse lines
+ for line in lines:
+ line = line.strip()
+ line = "".join(c for c in line if c not in string.punctuation)
+ if not line:
+ continue
+ for choice in choice_strings:
+ if match_fn(line, choice):
+ return choice
+ return INVALID_STR
+
+
+def expand_args_dict(args_dict):
+ """Expand a dict of dicts, with namings.
+
+ args_dict = {
+ "a": {"a1": 1, "a2": 2},
+ "b": {"b1": 3, "b2": 4},
+ }
+ expand_args_dict(args_dict) = {
+ "a=a1:b=b1": {"a": ("a1", 1), "b": ("b1", 3)},
+ "a=a1:b=b2": {"a": ("a1", 1), "b": ("b2", 4)},
+ ...}
+ """
+ args_dict = {k: list(v.items()) for k, v in args_dict.items()}
+ keys = list(args_dict.keys())
+ values = list(args_dict.values())
+ new_values = [dict(zip(keys, v)) for v in itertools.product(*values)]
+ new_names = [":".join([f"{k}={v[0]}" for k, v in sorted(d.items())]) for d in new_values]
+ return dict(zip(new_names, new_values))
+
+
+class ModelBasedClassify(evals.Eval):
+ invalid_request_during_completion = 0
+ invalid_request_during_evaluation = 0
+
+ def __init__(
+ self,
+ model_specs: evals.ModelSpecs,
+ samples_jsonl: str,
+ modelgraded_spec: str,
+ *args,
+ match_fn: str = "starts_or_endswith",
+ max_tokens: int = 1024,
+ multicomp_n: Union[int, str] = 1,
+ multicomp_temperature: float = 0.4,
+ samples_renamings: Optional[dict[str, str]] = None,
+ eval_type: Optional[str] = None,
+ metaeval: bool = False,
+ modelgraded_spec_args: Optional[dict[str, dict[str, str]]] = None,
+ **kwargs,
+ ):
+ super().__init__(model_specs, *args, **kwargs)
+ n_models = len(self.model_specs.completions)
+ self.max_tokens = max_tokens
+ self.samples_jsonl = samples_jsonl
+ self.match_fn = MATCH_FNS[match_fn]
+ self.metaeval = metaeval
+ if multicomp_n == "from_models":
+ assert n_models > 1, f"multicomp_n='from_models' but only 1 model is specified."
+ self.multicomp_n = n_models
+ else:
+ assert isinstance(
+ multicomp_n, int
+ ), f"multicomp_n={multicomp_n} must be an int or 'from_models'."
+ self.multicomp_n = multicomp_n
+ self.multicomp_temperature = multicomp_temperature
+ self.samples_renamings = samples_renamings or {}
+
+ # check if multiple models are specified
+ if len(self.model_specs.completions) > 1:
+ assert (
+ self.multicomp_n == n_models
+ ), f"multicomp_n={self.multicomp_n} must be equal to the number of models={len(self.model_specs.completions)} if multiple models are specified."
+
+ if self.model_spec.name == "dummy-completion" or self.model_spec.name == "dummy-chat":
+ self.eval_modelspec = self.model_spec
+ else:
+ self.eval_modelspec = ModelSpec(
+ name="gpt-3.5-turbo", model="gpt-3.5-turbo", is_chat=True
+ )
+
+ """import prompt and set attributes"""
+ modelgraded_specs = self.registry.get_modelgraded_spec(modelgraded_spec)
+ modelgraded_specs = copy.deepcopy(modelgraded_specs) # since pop() is used
+
+ # 'choice_strings' is a list of strings that specifies the possible choices
+ self.choice_strings = modelgraded_specs.pop("choice_strings")
+ if self.choice_strings == "from_n":
+ self.choice_strings = [str(i + 1) for i in range(self.multicomp_n)]
+ elif self.choice_strings == "from_n_abc":
+ self.choice_strings = [string.ascii_lowercase[i % 26] for i in range(self.multicomp_n)]
+ elif self.choice_strings == "from_n_ABC":
+ self.choice_strings = [string.ascii_uppercase[i % 26] for i in range(self.multicomp_n)]
+ # make sure each choice doesn't contain any punctuation
+ for s in self.choice_strings:
+ assert not any(c in s for c in string.punctuation), f"{s} contains punctuation"
+ # (optional) 'choice_scores' is a dict that specifies the score for each choice string
+ # if 'choice_scores' is specified, 'scores/' are computed and added to metrics
+ self.choice_scores = modelgraded_specs.pop("choice_scores", {})
+ if self.choice_scores == "from_strings":
+ self.choice_scores = {c: float(c) for c in self.choice_strings}
+ assert all(
+ isinstance(v, (int, float)) for v in self.choice_scores.values()
+ ), f"choice_scores must be a dict of floats, not {self.choice_scores}"
+
+ # (optional) 'eval_type' is a string that specifies the type of classification algorithm
+ # - "classify": only answer
+ # - "cot_classify": reason then answer (chain-of-thought) <- most recommended
+ # - "classify_cot": answer then reason (explanation)
+ # if 'eval_type' is not supplied from modelgraded_specs, then it must be supplied as an argument.
+ # - Importantly, it also assumes the answer prompt needs to be appended to the prompt.
+ self.eval_type = modelgraded_specs.pop("eval_type", None)
+ if not self.eval_type:
+ append_answer_prompt = True # append answer prompt to prompt
+ assert eval_type, "eval_type must be specified, in modelgraded_spec or as an argument"
+ self.eval_type = eval_type
+ else:
+ assert (
+ not eval_type
+ ), f"eval_type must be unspecified, if it is specified in modelgraded_spec"
+ append_answer_prompt = False
+
+ # 'prompt' is a string that specifies the model-graded evaluation
+ prompt = modelgraded_specs.pop("prompt")
+ assert isinstance(prompt, str), f"prompt must be a string, not {type(prompt)}"
+ if append_answer_prompt:
+ prompt += "\n\n" + ANSWER_PROMPTS[self.eval_type].format(
+ choices=choice_to_str(self.choice_strings)
+ )
+ self.prompt = [{"role": "user", "content": prompt}]
+
+ # 'input_outputs' is a dict that specifies the input and output keys in the sample
+ # output key is the model's raw response to input key. These are used for filling 'prompt' template.
+ self.input_outputs = modelgraded_specs.pop("input_outputs")
+ assert isinstance(
+ self.input_outputs, dict
+ ), f"input_outputs must be a dict, not {type(self.input_outputs)}"
+
+ # (optional) 'args' is a dict of dicts that specifies additional arguments for 'prompt'
+ # each value in 'args_dict' essentially defines a separate modelgraded classification eval and has own metrics!
+ # if 'modelgraded_spec_args' is specified in eval YAML, it is merged with 'args_dict'
+ self.args_dict = modelgraded_specs.pop("args", {})
+ self.args_dict.update(modelgraded_spec_args or {})
+ if self.args_dict:
+ self.expanded_args_dict = expand_args_dict(self.args_dict)
+ else:
+ self.expanded_args_dict = {}
+
+ # (optional) 'completion_sample_templates'
+ # each key must be one of 'input_outputs'.values(). If 'multicomp_n' > 1, this template is filled 'multicomp_n' times
+ # and the concatenated result is passed to 'prompt' template.
+ self.completion_sample_templates = modelgraded_specs.pop("completion_sample_templates", {})
+ assert all(
+ k in self.input_outputs.values() for k in self.completion_sample_templates
+ ), f"all {self.completion_sample_templates.keys()} must be in {self.input_outputs.values()}, "
+ if self.multicomp_n > 1:
+ assert (
+ self.completion_sample_templates
+ ), "completion_sample_templates must be specified if multicomp_n > 1"
+
+ # since we accept optional args, we need to check that all args are used
+ for key in ("key", "group"):
+ modelgraded_specs.pop(key, None)
+ assert not modelgraded_specs, f"Unused args: {modelgraded_specs}. Typo in YAML?"
+
+ def eval_sample(self, test_sample: dict, rng: Random) -> None:
+ """Evaluate a single sample.
+
+ Recorded metrics are always: one of the self.choice_strings, or "__invalid__".
+ """
+ if self.samples_renamings:
+ test_sample = {self.samples_renamings.get(k, k): v for k, v in test_sample.items()}
+ if self.multicomp_n > 1:
+ test_sample["n"] = self.multicomp_n
+ completions = {}
+ if self.metaeval:
+ # assert outputs exist in the data
+ for v in self.input_outputs.values():
+ assert v in test_sample, f"Missing output '{v}' in sample {test_sample.keys()}"
+ completions[v] = test_sample[v]
+ # remove outputs from the data
+ test_sample = {
+ k: v for k, v in test_sample.items() if k not in list(self.input_outputs.values())
+ }
+ for k in self.input_outputs:
+ test_sample[k] = scrub_formatting_from_prompt(test_sample[k])
+
+ if not self.metaeval:
+ try:
+ for k, v in self.input_outputs.items():
+ if self.multicomp_n > 1 and v in self.completion_sample_templates:
+ completion = ""
+ completion_i_template = self.completion_sample_templates[v]
+ for i in range(self.multicomp_n):
+ if len(self.model_specs.completions) > 1:
+ # use a separate model for each completion
+ model_spec = self.model_specs.completions[i]
+ else:
+ # use the single model for all completions
+ model_spec = self.model_spec
+ get_input_completion = PromptFn(
+ test_sample[k],
+ model_spec=model_spec,
+ max_tokens=self.max_tokens,
+ temperature=self.multicomp_temperature,
+ )
+ completion_i, _ = get_input_completion()
+ completion += format_necessary(
+ completion_i_template,
+ i=i + 1,
+ i_abc=string.ascii_lowercase[i % 26],
+ i_ABC=string.ascii_uppercase[i % 26],
+ output=completion_i,
+ n=self.multicomp_n,
+ )
+ else:
+ get_input_completion = PromptFn(
+ test_sample[k],
+ model_spec=self.model_spec,
+ max_tokens=self.max_tokens,
+ )
+ completion, _ = get_input_completion()
+ completions[v] = completion
+ except openai.error.InvalidRequestError:
+ self.invalid_request_during_completion += 1
+ return
+
+ try:
+ metrics = {}
+ evaluate = PromptFn(
+ self.prompt,
+ model_spec=self.eval_modelspec,
+ max_tokens=self.max_tokens,
+ )
+ eval_kwargs = dict(**completions, **test_sample)
+ if self.expanded_args_dict and len(self.expanded_args_dict) > 1:
+ args_dict = self.expanded_args_dict
+ elif self.expanded_args_dict and len(self.expanded_args_dict) == 1:
+ # if there is only one combination, don't bother with the metric name
+ args_dict = {CHOICE_KEY: v for v in self.expanded_args_dict.values()}
+ else:
+ args_dict = {CHOICE_KEY: {}}
+ for metric, args in args_dict.items():
+ args = {k: v[1] for k, v in args.items()}
+ evaluation, _ = evaluate(**args, **eval_kwargs)
+ choice = get_choice(evaluation, self.eval_type, self.match_fn, self.choice_strings)
+ if choice == INVALID_STR:
+ logging.warn(
+ f"Choices {self.choice_strings} not parsable for {self.eval_type}: {evaluation}"
+ )
+ metrics[metric] = choice
+ if self.metaeval:
+ assert (
+ metric in test_sample
+ ), f"Missing label for metric '{metric}' in sample {test_sample.keys()}"
+ metrics[metric + "_metascore"] = choice == test_sample[metric]
+
+ except openai.error.InvalidRequestError:
+ self.invalid_request_during_evaluation += 1
+ return
+
+ evals.record.record_metrics(**metrics)
+
+ return choice
+
+ def run(self, recorder):
+ samples = evals.get_jsonl(self.samples_jsonl)
+
+ self.eval_all_samples(recorder, samples)
+ all_sample_metrics = recorder.get_metrics()
+
+ record_metrics = {}
+ if self.expanded_args_dict and len(self.expanded_args_dict) > 1:
+ metrics = sorted(self.expanded_args_dict)
+ else:
+ metrics = [CHOICE_KEY]
+ for metric in metrics:
+ chosen = [m[metric] for m in all_sample_metrics if metric in m]
+ # if there is a best choice, compute the score
+ if self.choice_scores:
+ # assumption: each INVALID_STR contributes the lowest score
+ lowest_score = min(self.choice_scores.values())
+ scores = [
+ self.choice_scores[choice] if choice != INVALID_STR else lowest_score
+ for choice in chosen
+ ]
+ record_metrics[f"score/{metric}"] = sum(scores) / len(all_sample_metrics)
+ # compute the counts and ratios
+ counts = dict(Counter(chosen))
+ missing_samples = len(all_sample_metrics) - len(chosen)
+ if missing_samples:
+ counts["__missing_samples__"] = missing_samples
+ record_metrics.update({f"counts/{metric}/{k}": v for k, v in counts.items()})
+ if self.metaeval:
+ metascores = [m[metric + "_metascore"] for m in all_sample_metrics if metric in m]
+ record_metrics[f"metascore/{metric}"] = sum(metascores) / len(all_sample_metrics)
+
+ record_metrics["invalid_request_during_completion"] = self.invalid_request_during_completion
+ record_metrics["invalid_request_during_evaluation"] = self.invalid_request_during_evaluation
+
+ return record_metrics
diff --git a/evals/evals/elsuite/translate.py b/evals/evals/elsuite/translate.py
new file mode 100644
index 0000000000000000000000000000000000000000..42cf8c77844d5a21abf9314537e93212f29a5e7d
--- /dev/null
+++ b/evals/evals/elsuite/translate.py
@@ -0,0 +1,75 @@
+from typing import Any
+
+from sacrebleu.metrics.bleu import BLEU
+
+import evals
+import evals.metrics
+from evals.prompt.base import is_chat_prompt
+
+
+class Translate(evals.Eval):
+ def __init__(
+ self,
+ model_specs: evals.ModelSpecs,
+ samples_jsonl: str,
+ *args,
+ max_tokens: int = 500,
+ num_few_shot: int = 0,
+ few_shot_jsonl: str = None,
+ **kwargs,
+ ):
+ super().__init__(model_specs, *args, **kwargs)
+ self.max_tokens = max_tokens
+ self.samples_jsonl = samples_jsonl
+
+ self.num_few_shot = num_few_shot
+ if self.num_few_shot > 0:
+ assert few_shot_jsonl is not None, "few shot requires few shot sample dataset"
+ self.few_shot_jsonl = few_shot_jsonl
+ self.few_shot = evals.get_jsonl(self.few_shot_jsonl)
+
+ self.bleu = BLEU(effective_order=True)
+
+ def eval_sample(self, sample: Any, *_):
+ prompt = sample["input"]
+ expected = sample["ideal"]
+ if self.num_few_shot > 0:
+ assert is_chat_prompt(sample["input"]), "few shot requires chat prompt"
+ prompt = sample["input"][:-1]
+ for s in self.few_shot[: self.num_few_shot]:
+ prompt += s["sample"]
+ prompt += sample["input"][-1:]
+
+ if isinstance(expected, tuple):
+ expected = list(expected)
+ elif not isinstance(expected, list):
+ expected = [expected]
+
+ sampled = evals.sample_freeform(self.model_spec, prompt, max_tokens=self.max_tokens)
+
+ score = None
+ if expected is not None:
+ score = self.bleu.sentence_score(sampled, expected).score
+ evals.record.record_metrics(sacrebleu_sentence_score=score)
+
+ match = score > 30
+
+ if score is not None:
+ evals.record.record_match(
+ match, expected=expected, sampled=sampled, sacrebleu_sentence_score=score
+ )
+ return match
+
+ def run(self, recorder):
+ samples = evals.get_jsonl(self.samples_jsonl)
+ self.eval_all_samples(recorder, samples)
+ events = recorder.get_events("match")
+
+ sampled = list(map(lambda e: e.data["sampled"], events))
+ expected = list(map(lambda e: e.data["expected"], events))
+ sacrebleu_score = BLEU().corpus_score(sampled, [expected]).score
+
+ return {
+ "accuracy": evals.metrics.get_accuracy(events),
+ "sacrebleu_score": sacrebleu_score,
+ }
diff --git a/evals/evals/elsuite/utils.py b/evals/evals/elsuite/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..615b51ac1c45e1ec6641903cf4b4af7e4e284396
--- /dev/null
+++ b/evals/evals/elsuite/utils.py
@@ -0,0 +1,140 @@
+import copy
+import re
+import string
+from collections import Counter, defaultdict
+
+from evals.api import sample_freeform
+from evals.prompt.base import chat_prompt_to_text_prompt, is_chat_prompt
+
+
+def get_answer(text, answer_prompt):
+ idx = text.rfind(answer_prompt)
+ if idx == -1:
+ return None
+ return text[idx + len(answer_prompt) :]
+
+
+def get_consensus(answers):
+ counts = defaultdict(int)
+ for answer in answers:
+ counts[answer] += 1
+ counts[None] = 0
+ return max(counts, key=counts.get)
+
+
+def normalize(s: str) -> str:
+ """Lower text and remove punctuation, articles and extra whitespace."""
+ s = s.split("\n")[0]
+ s = s.lower()
+ exclude = set(string.punctuation)
+ s = "".join(char for char in s if char not in exclude)
+ s = re.sub(r"\b(a|an|the)\b", " ", s)
+ s = " ".join(s.split())
+ return s
+
+
+def fuzzy_match(s1: str, s2: str) -> bool:
+ s1 = normalize(s1)
+ s2 = normalize(s2)
+
+ if s1 == "" or s2 == "":
+ return s1 == s2
+
+ return s1 in s2 or s2 in s1
+
+
+def get_scores_from_text(text: str) -> dict:
+ pattern = r"## (.+?)\n.+?(\d)/5"
+ matches = re.findall(pattern, text, re.DOTALL)
+ return {k: int(v) for k, v in dict(matches).items()}
+
+
+def get_yesno_from_text(text: str) -> dict:
+ pattern = r"## (.+?)\n.+?([yn])"
+ matches = re.findall(pattern, text, re.DOTALL)
+ return {k: v for k, v in dict(matches).items()}
+
+
+def get_letter_from_data(data: str) -> str:
+ last_y = (data.rfind("y"), "y")
+ last_n = (data.rfind("n"), "n")
+ char = max(last_y, last_n)[1]
+ return char
+
+
+def f1_score(prediction: str, answers: list[str]) -> float:
+ def _f1_score(prediction: str, ground_truth: str):
+ prediction_tokens = normalize(prediction).split()
+ ground_truth_tokens = normalize(ground_truth).split()
+ common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
+ num_same = sum(common.values())
+ if num_same == 0:
+ return 0
+ precision = 1.0 * num_same / len(prediction_tokens)
+ recall = 1.0 * num_same / len(ground_truth_tokens)
+ f1 = (2 * precision * recall) / (precision + recall)
+ return f1
+
+ return max([_f1_score(prediction, answer) for answer in answers])
+
+
+def scrub_formatting_from_prompt(prompt):
+ scrubbed_prompt = copy.copy(prompt)
+
+ if is_chat_prompt(prompt):
+ for i, msg in enumerate(scrubbed_prompt):
+ if "content" in msg:
+ scrubbed_prompt[i]["content"] = msg["content"].replace("{", "{{").replace("}", "}}")
+ else:
+ scrubbed_prompt = scrubbed_prompt.replace("{", "{{").replace("}", "}}")
+ return scrubbed_prompt
+
+
+def format_necessary(template: str, **kwargs: dict[str, str]) -> str:
+ """Format a template string with only necessary kwargs."""
+ keys = [k[1] for k in string.Formatter().parse(template) if k[1]]
+ assert all(k in kwargs for k in keys), f"Required: {keys}, got: {sorted(kwargs)}"
+ cur_keys = {k: kwargs[k] for k in keys}
+ return template.format(**cur_keys)
+
+
+class PromptFn:
+ """Wrap calls to model with prompt"""
+
+ def __init__(self, prompt, model_spec, max_tokens, temperature=0, completion_kwargs=None):
+ self.prompt = prompt
+ self.max_tokens = max_tokens
+ self.model_spec = model_spec
+ self.temperature = temperature
+ self.completion_kwargs = completion_kwargs or {}
+
+ def __call__(self, **kwargs):
+ # if any input kwargs is chat prompt, convert to text prompt
+ kwargs = {
+ k: chat_prompt_to_text_prompt(v, render_for_completion=False)
+ if is_chat_prompt(v)
+ else v
+ for k, v in kwargs.items()
+ }
+ if is_chat_prompt(self.prompt):
+ prompt = []
+ for msg in self.prompt:
+ formatted_msg = copy.copy(msg)
+ if "content" in formatted_msg:
+ formatted_msg["content"] = format_necessary(formatted_msg["content"], **kwargs)
+ prompt.append(formatted_msg)
+ else:
+ # Prompt is a string
+ prompt = format_necessary(self.prompt, **kwargs)
+
+ completion = sample_freeform(
+ self.model_spec,
+ prompt,
+ max_tokens=self.max_tokens,
+ temperature=self.temperature,
+ top_p=1,
+ frequency_penalty=0,
+ presence_penalty=0,
+ **self.completion_kwargs,
+ )
+ return completion, prompt
diff --git a/evals/evals/eval.py b/evals/evals/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..845123e0cf5c95d3d0bf829a7311fc995c642a29
--- /dev/null
+++ b/evals/evals/eval.py
@@ -0,0 +1,155 @@
+"""
+This file defines the base class for evals.
+"""
+import abc
+import asyncio
+import concurrent.futures
+import logging
+import os
+import random
+from multiprocessing.pool import ThreadPool
+from typing import Any, Awaitable, Callable, Dict, List, Optional, Tuple
+
+from tqdm import tqdm
+
+from .base import ModelSpec, ModelSpecs
+from .record import RecorderBase
+from .registry import Registry
+
+logger = logging.getLogger(__name__)
+
+
+SHUFFLE_SEED = 123
+_MAX_SAMPLES = None
+
+
+def _index_samples(samples: List[Any]) -> List[Tuple[Any, int]]:
+ """Shuffle `samples` and pair each sample with its index."""
+ indices = list(range(len(samples)))
+ random.Random(SHUFFLE_SEED).shuffle(indices)
+ if _MAX_SAMPLES is not None:
+ indices = indices[:_MAX_SAMPLES]
+ logger.info(f"Evaluating {len(indices)} samples")
+ work_items = [(samples[i], i) for i in indices]
+ return work_items
+
+
+def set_max_samples(max_samples: int):
+ global _MAX_SAMPLES
+ _MAX_SAMPLES = max_samples
+
+
+class Eval(abc.ABC):
+ """
+ Evaluation classes generally should override two methods:
+ `eval_sample`: Takes in a test sample and a random number generator and
+ records the metrics of interest.
+ `run`: Takes in a recorder and runs the evaluation. Generally, most `run`
+ methods will follow this same pattern: loading the data, calling
+ `eval_all_samples`, and aggregating the recorded results.
+ """
+
+ def __init__(
+ self,
+ model_specs: ModelSpecs,
+ seed: int = 20220722,
+ name: str = "no_name_eval.default",
+ registry: Optional[Registry] = None,
+ ):
+ splits = name.split(".")
+ if len(splits) < 2:
+ raise ValueError(f"Eval name must at least have .. Got name {name}")
+
+ self.model_specs = model_specs
+ self.seed = seed
+ self.name = name
+ self.registry = registry or Registry()
+
+ def eval_sample(self, sample: Any, rng: random.Random):
+ raise NotImplementedError()
+
+ @classmethod
+ def create_and_run(cls, model_specs: ModelSpecs, *args, **kwargs) -> Dict[str, float]:
+ logging.info(f"Running {cls.__name__} with {model_specs}, args: {args}, kwargs: {kwargs}")
+ return cls(model_specs).run(*args, **kwargs)
+
+ @property
+ def model_spec(self) -> ModelSpec:
+ """Helper for more ergonomic access to a single model."""
+ return self.model_specs.completion
+
+ @abc.abstractmethod
+ def run(self, recorder: RecorderBase) -> Dict[str, float]:
+ """Run the evaluation with the corresponding recorder."""
+ raise NotImplementedError()
+
+ async def async_eval_all_samples(
+ self,
+ eval_fn: Callable[[Tuple[Any, int]], Awaitable[Tuple[int, Any]]],
+ samples: List[Any],
+ concurrency: int = 32,
+ show_progress: bool = True,
+ ):
+ work_items = _index_samples(samples)
+ semaphore = asyncio.Semaphore(concurrency)
+
+ async def eval_fn_with_semaphore(args):
+ async with semaphore:
+ return await eval_fn(args)
+
+ futures = [asyncio.ensure_future(eval_fn_with_semaphore(args)) for args in work_items]
+
+ for future in tqdm(
+ asyncio.as_completed(futures), total=len(samples), disable=not show_progress
+ ):
+ await future
+
+ def eval_all_samples(
+ self,
+ recorder: RecorderBase,
+ samples,
+ show_progress=True,
+ ):
+ """
+ Evaluate all provided samples in parallel.
+ """
+ work_items = _index_samples(samples)
+ threads = int(os.environ.get("EVALS_THREADS", "10"))
+ show_progress = bool(os.environ.get("EVALS_SHOW_EVAL_PROGRESS", show_progress))
+ timeout = float(os.environ.get("EVALS_THREAD_TIMEOUT", "40"))
+
+ def eval_sample(args):
+ """
+ Evaluate a single sample.
+ """
+ sample, idx = args
+ base_name, split = self.name.split(".")[0:2]
+ sample_id = f"{base_name}.{split}.{idx}"
+ with recorder.as_default_recorder(sample_id):
+ recorder.record_raw(sample)
+ seed = f"{sample_id}:{self.seed}".encode("utf-8")
+ rng = random.Random(seed)
+ return idx, self.eval_sample(sample, rng)
+
+ def worker_thread(args):
+ """
+ Worker thread for evaluating a single sample.
+ """
+ while True:
+ executor = concurrent.futures.ThreadPoolExecutor(max_workers=1)
+ future = executor.submit(eval_sample, args=args)
+ try:
+ result = future.result(timeout=timeout)
+ return result
+ except concurrent.futures.TimeoutError as e:
+ executor.shutdown(wait=False)
+
+ with ThreadPool(threads) as pool:
+ if os.environ.get("EVALS_SEQUENTIAL", "0") in {"1", "true", "yes"}:
+ logger.info(f"Running in sequential mode!")
+ iter = map(eval_sample, work_items)
+ else:
+ logger.info(f"Running in threaded mode with {threads} threads!")
+ iter = pool.imap_unordered(worker_thread, work_items)
+ idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+ return [r for _, r in sorted(idx_and_result)]
diff --git a/evals/evals/formatting.py b/evals/evals/formatting.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec2a2843f388106a58fbbd51e4430ebb1113dec8
--- /dev/null
+++ b/evals/evals/formatting.py
@@ -0,0 +1,34 @@
+"""
+This file defines utilities for adding multiple choice questions to prompts.
+"""
+import random
+from typing import Optional
+
+
+def make_abc(answers, *, correct_idx=0, shuffle=True, rng: Optional[random.Random] = None):
+ """
+ ARGS
+ ====
+ `answers`: A sequence of strings, each of which is an answer choice.
+ `correct_idx`: The integer index of the correct answer.
+ `shuffle`: If True, shuffle the answer choices in the returned string.
+ `rng`: If `shuffle` is True, this is the random number generator to use.
+
+ RETURNS
+ =======
+ A tuple of (options, correct_answer) where `options` is a string of
+ newline-separated answer choices (e.g., "A) blah") and `correct_answer` is
+ the correct answer as a single character (e.g., "A").
+ """
+
+ p = list(range(len(answers)))
+ if shuffle:
+ if rng is None:
+ raise ValueError("shuffle=True requires rng")
+ rng.shuffle(p)
+ options = ""
+ for i, j in enumerate(p):
+ if i > 0:
+ options += "\n"
+ options += chr(ord("A") + i) + ") " + answers[j]
+ return options, chr(ord("A") + p.index(correct_idx))
diff --git a/evals/evals/metrics.py b/evals/evals/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f9389a3557d1ef6d65c399f8aeb201e7aa8059e
--- /dev/null
+++ b/evals/evals/metrics.py
@@ -0,0 +1,76 @@
+"""
+This file defines various common metrics of interest.
+"""
+import random
+from typing import Optional, Sequence, Set
+
+import numpy as np
+
+from evals.record import Event
+
+
+def get_accuracy(events: Sequence[Event]) -> float:
+ num_correct = 0
+ num_total = 0
+ for event in events:
+ num_total += 1
+ num_correct += int(event.data["correct"])
+ if num_total == 0:
+ return float("nan")
+ else:
+ return num_correct / num_total
+
+
+def get_bootstrap_accuracy_std(events: Sequence[Event], num_samples: int = 1000):
+ vals = [m.data["correct"] for m in events]
+ return np.std([np.mean(random.sample(vals, len(vals) // 2)) for _ in range(1000)])
+
+
+def get_confusion_matrix(
+ matches: Sequence[Event], class_labels: Optional[Set] = None
+) -> np.ndarray:
+ labels = set()
+ for match in matches:
+ labels.add(match.data["expected"])
+ if class_labels is None:
+ labels = {label: i for i, label in enumerate(sorted(labels))}
+ else:
+ assert labels.issubset(class_labels)
+ labels = {label: i for i, label in enumerate(class_labels)}
+ result = np.zeros((len(labels), len(labels) + 1), dtype=int)
+ for match in matches:
+ i = labels[match.data["expected"]]
+ j = labels.get(match.data["picked"], len(labels))
+ result[i, j] += 1
+ return result
+
+
+def compute_matthew_corr(confusion_matrix):
+ assert confusion_matrix.shape == (2, 3), f"Got shape: {confusion_matrix.shape}"
+ r = confusion_matrix[:, :2]
+ r[:, 0] += confusion_matrix[:, 2]
+ return (r[1, 1] * r[0, 0] - r[1, 0] * r[0, 1]) / np.sqrt(
+ r[1, :].sum() * r[0, :].sum() * r[:, 0].sum() * r[:, 1].sum()
+ )
+
+
+def compute_precision(confusion_matrix, idx=0):
+ return confusion_matrix[idx, idx] / confusion_matrix[:, idx].sum()
+
+
+def compute_recall(confusion_matrix, idx=0):
+ return confusion_matrix[idx, idx] / confusion_matrix[idx, :].sum()
+
+
+def compute_f_score(confusion_matrix, idx=0, beta=1.0):
+ precision = compute_precision(confusion_matrix, idx=idx)
+ recall = compute_recall(confusion_matrix, idx=idx)
+ return (1 + beta**2) * (precision * recall) / (beta**2 * precision + recall)
+
+
+def compute_averaged_f_score(confusion_matrix, beta=1.0, average="macro"):
+ assert average in ["macro"]
+ f_scores = []
+ for i in range(confusion_matrix.shape[0]):
+ f_scores.append(compute_f_score(confusion_matrix, idx=i, beta=beta))
+ return np.array(f_scores).mean()
diff --git a/evals/evals/prompt/base.py b/evals/evals/prompt/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..71946a2e4f80cb02b22224f8c27edc5199180ff7
--- /dev/null
+++ b/evals/evals/prompt/base.py
@@ -0,0 +1,118 @@
+"""
+This file defines the classes for how to manage prompts for different types of
+models, i.e., "chat models" vs. "non chat models".
+"""
+import logging
+import threading
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from typing import Dict, List, Union
+
+logger = logging.getLogger(__name__)
+ENCODER_LOCK = threading.Lock()
+
+# This is an approximation to the type accepted as the `prompt` field to `openai.Completion.create` calls
+OpenAICreatePrompt = Union[str, list[str], list[int], list[list[int]]]
+
+# This is the type accepted as the `prompt` field to `openai.ChatCompletion.create` calls
+OpenAIChatMessage = Dict[str, str] # A message is a dictionary with "role" and "content" keys
+OpenAICreateChatPrompt = List[OpenAIChatMessage] # A chat log is a list of messages
+
+
+def chat_prompt_to_text_prompt(
+ prompt: OpenAICreateChatPrompt, render_for_completion: bool = True
+) -> str:
+ """
+ Render a chat prompt as a text prompt. User and assistant messages are separated by newlines
+ and prefixed with "User: " and "Assistant: ", respectively, unless there is only one message.
+ System messages have no prefix.
+ """
+ assert is_chat_prompt(prompt), f"Expected a chat prompt, got {prompt}"
+ chat_to_prefixes = {
+ # roles
+ "system": "",
+ # names
+ "example_user": "User: ",
+ "example_assistant": "Assistant: ",
+ }
+
+ # For a single message, be it system, user, or assistant, just return the message
+ if len(prompt) == 1:
+ return prompt[0]["content"]
+
+ text = ""
+ for msg in prompt:
+ role = msg["name"] if "name" in msg else msg["role"]
+ prefix = chat_to_prefixes.get(role, role.capitalize() + ": ")
+ content = msg["content"]
+ text += f"{prefix}{content}\n"
+ if render_for_completion:
+ text += "Assistant: "
+ return text.lstrip()
+
+
+def text_prompt_to_chat_prompt(prompt: str, role: str = "system") -> OpenAICreateChatPrompt:
+ assert isinstance(prompt, str), f"Expected a text prompt, got {prompt}"
+ return [
+ {"role": role, "content": prompt},
+ ]
+
+
+@dataclass
+class Prompt(ABC):
+ """
+ A `Prompt` encapsulates everything required to present the `raw_prompt` in different formats,
+ e.g., a normal unadorned format vs. a chat format.
+ """
+
+ @abstractmethod
+ def to_openai_create_prompt(self):
+ """
+ Return the actual data to be passed as the `prompt` field to either `openai.ChatCompletion.create`,
+ if the model is a chat model, or `openai.Completion.create` otherwise.
+ See the above types to see what each API call is able to handle.
+ """
+
+
+def is_chat_prompt(prompt: Prompt) -> bool:
+ return isinstance(prompt, list) and all(isinstance(msg, dict) for msg in prompt)
+
+
+@dataclass
+class CompletionPrompt(Prompt):
+ """
+ A `Prompt` object that wraps prompts to be compatible with non chat models, which use `openai.Completion.create`.
+ """
+
+ raw_prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt]
+
+ def _render_chat_prompt_as_text(self, prompt: OpenAICreateChatPrompt) -> OpenAICreatePrompt:
+ return chat_prompt_to_text_prompt(prompt)
+
+ def to_openai_create_prompt(self) -> OpenAICreatePrompt:
+ if is_chat_prompt(self.raw_prompt):
+ return self._render_chat_prompt_as_text(self.raw_prompt)
+ return self.raw_prompt
+
+
+@dataclass
+class ChatCompletionPrompt(Prompt):
+ """
+ A `Prompt` object that wraps prompts to be compatible with chat models, which use `openai.ChatCompletion.create`.
+
+ The format expected by chat models is a list of messages, where each message is a dict with "role" and "content" keys.
+ """
+
+ raw_prompt: Union[OpenAICreatePrompt, OpenAICreateChatPrompt]
+
+ def _render_text_as_chat_prompt(self, prompt: str) -> OpenAICreateChatPrompt:
+ """
+ Render a text string as a chat prompt. The default option we adopt here is to simply take the full prompt
+ and treat it as a system message.
+ """
+ return text_prompt_to_chat_prompt(prompt)
+
+ def to_openai_create_prompt(self) -> OpenAICreateChatPrompt:
+ if is_chat_prompt(self.raw_prompt):
+ return self.raw_prompt
+ return self._render_text_as_chat_prompt(self.raw_prompt)
diff --git a/evals/evals/record.py b/evals/evals/record.py
new file mode 100644
index 0000000000000000000000000000000000000000..d02ab20d5fbeb1762c53a7fc308ae214057f9b6b
--- /dev/null
+++ b/evals/evals/record.py
@@ -0,0 +1,480 @@
+"""
+This file defines the recorder classes which log eval results in different ways,
+such as to a local JSON file or to a remote Snowflake database.
+
+If you would like to implement a custom recorder, you can see how the
+`LocalRecorder` and `Recorder` classes inherit from the `RecorderBase` class and
+override certain methods.
+"""
+import atexit
+import contextlib
+import dataclasses
+import logging
+import threading
+import time
+from contextvars import ContextVar
+from datetime import datetime, timezone
+from typing import Any, List, Optional, Sequence
+
+import blobfile as bf
+
+import evals
+from evals.base import RunSpec
+from evals.data import jsondumps
+from evals.utils.misc import t
+from evals.utils.snowflake import SnowflakeConnection
+
+logger = logging.getLogger(__name__)
+
+MIN_FLUSH_EVENTS = 100
+MAX_SNOWFLAKE_BYTES = 16 * 10**6
+MIN_FLUSH_SECONDS = 10
+
+_default_recorder: ContextVar[Optional["RecorderBase"]] = ContextVar(
+ "default_recorder", default=None
+)
+
+
+def default_recorder() -> Optional["RecorderBase"]:
+ return _default_recorder.get()
+
+
+@dataclasses.dataclass
+class Event:
+ run_id: str
+ event_id: int
+ sample_id: Optional[str]
+ type: str
+ data: dict
+ created_by: str
+ created_at: str
+
+
+class RecorderBase:
+ """
+ The standard events for which recording methods are provided are:
+ - `match`: A match or non match, as specified by the `correct` bool, between
+ the `expected` and `picked` results.
+ - `embedding`: An embedding of the `prompt` of type `embedding_type`.
+ - `sampling`: What was `sampled` from the model given the input `prompt`.
+ - `cond_logp`: The conditional log probability, as `logp`, of the
+ `completion` from the model given the input `prompt`.
+ - `pick_option`: The option `picked` by the model out of the valid `options`
+ given the input `prompt`.
+ - `raw`: A raw sample specified by the `data`.
+ - `metrics`: A set of metrics specified by the `kwargs`.
+ - `error`: An `error` along with an accompanying `msg`.
+ - `extra`: Any extra `data` of interest to be recorded.
+ For these events, helper methods are defined at the bottom of this file.
+ More generally, you can record any event by calling `record_event` with the
+ event `type` and `data`.
+ Finally, you can also record a final report using `record_final_report`.
+ """
+
+ def __init__(
+ self,
+ run_spec: evals.base.RunSpec,
+ ) -> None:
+ self._sample_id: ContextVar[Optional[int]] = ContextVar("_sample_id", default=None)
+ self.run_spec = run_spec
+ self._events: List[Event] = []
+ self._last_flush_time = time.time()
+ self._flushes_done = 0
+ self._written_events = 0
+ self._flushes_started = 0
+ self._event_lock = threading.Lock()
+ atexit.register(self.flush_events)
+
+ @contextlib.contextmanager
+ def as_default_recorder(self, sample_id: str):
+ sample_id_token = self._sample_id.set(sample_id)
+ default_recorder_token = _default_recorder.set(self)
+ yield
+ _default_recorder.reset(default_recorder_token)
+ self._sample_id.reset(sample_id_token)
+
+ def current_sample_id(self) -> Optional[str]:
+ return self._sample_id.get()
+
+ def get_events(self, type: str) -> Sequence[Event]:
+ with self._event_lock:
+ return [event for event in self._events if event.type == type]
+
+ def get_metrics(self):
+ return list(map(lambda x: x.data, self.get_events("metrics")))
+
+ def get_scores(self, key: str):
+ return list(map(lambda e: e.data[key], self.get_events("metrics")))
+
+ def _create_event(self, type, data=None, sample_id=None):
+ if sample_id is None:
+ sample_id = self.current_sample_id()
+ if sample_id is None:
+ raise ValueError("No sample_id set! Either pass it in or use as_default_recorder!")
+
+ return Event(
+ run_id=self.run_spec.run_id,
+ event_id=len(self._events),
+ type=type,
+ sample_id=sample_id,
+ data=data,
+ created_by=self.run_spec.created_by,
+ created_at=str(datetime.now(timezone.utc)),
+ )
+
+ def _flush_events_internal(self, events_to_write: Sequence[Event]):
+ pass
+
+ def flush_events(self):
+ with self._event_lock:
+ if len(self._events) == self._written_events:
+ return
+ events_to_write = self._events[self._written_events :]
+ self._written_events = len(self._events)
+ self._flushes_started += 1
+ self._flush_events_internal(events_to_write)
+
+ def record_event(self, type, data=None, sample_id=None):
+ if sample_id is None:
+ sample_id = self.current_sample_id()
+ if sample_id is None:
+ raise ValueError("No sample_id set! Either pass it in or use as_default_recorder!")
+
+ with self._event_lock:
+ event = Event(
+ run_id=self.run_spec.run_id,
+ event_id=len(self._events),
+ type=type,
+ sample_id=sample_id,
+ data=data,
+ created_by=self.run_spec.created_by,
+ created_at=str(datetime.now(timezone.utc)),
+ )
+ self._events.append(event)
+ if (
+ self._flushes_done < self._flushes_started
+ or len(self._events) < self._written_events + MIN_FLUSH_EVENTS
+ or time.time() < self._last_flush_time + MIN_FLUSH_SECONDS
+ ):
+ return
+ events_to_write = self._events[self._written_events :]
+ self._written_events = len(self._events)
+ self._flushes_started += 1
+ self._flush_events_internal(events_to_write)
+
+ def record_match(self, correct: bool, *, expected=None, picked=None, sample_id=None, **extra):
+ assert isinstance(
+ correct, bool
+ ), f"correct must be a bool, but was a {type(correct)}: {correct}"
+
+ if isinstance(expected, list) and len(expected) == 1:
+ expected = expected[0]
+ data = {
+ "correct": bool(correct),
+ "expected": expected,
+ "picked": picked,
+ **extra,
+ }
+ self.record_event("match", data, sample_id=sample_id)
+
+ def record_embedding(self, prompt, embedding_type, sample_id=None, **extra):
+ data = {
+ "prompt": prompt,
+ "embedding_type": embedding_type,
+ **extra,
+ }
+ self.record_event("embedding", data, sample_id=sample_id)
+
+ def record_sampling(self, prompt, sampled, sample_id=None, **extra):
+ data = {
+ "prompt": prompt,
+ "sampled": sampled,
+ **extra,
+ }
+ self.record_event("sampling", data, sample_id=sample_id)
+
+ def record_cond_logp(self, prompt, completion, logp, sample_id=None, **extra):
+ data = {
+ "prompt": prompt,
+ "completion": completion,
+ "logp": logp,
+ **extra,
+ }
+ self.record_event("cond_logp", data, sample_id=sample_id)
+
+ def record_pick_option(self, prompt, options, picked, sample_id=None, **extra):
+ data = {
+ "prompt": prompt,
+ "options": options,
+ "picked": picked,
+ **extra,
+ }
+ self.record_event("pick_option", data, sample_id=sample_id)
+
+ def record_raw(self, data):
+ self.record_event("raw_sample", data)
+
+ def record_metrics(self, **kwargs):
+ self.record_event("metrics", kwargs)
+
+ def record_error(self, msg: str, error: Exception, **kwargs):
+ data = {
+ "type": type(error).__name__,
+ "message": str(error),
+ }
+ data.update(kwargs)
+ self.record_event("error", data)
+
+ def record_extra(self, data, sample_id=None):
+ self.record_event("extra", data, sample_id=sample_id)
+
+ def record_final_report(self, final_report: Any):
+ logging.info(f"Final report: {final_report}. Not writing anywhere.")
+
+
+def _green(str):
+ return f"\033[1;32m{str}\033[0m"
+
+
+def _red(str):
+ return f"\033[1;31m{str}\033[0m"
+
+
+class DummyRecorder(RecorderBase):
+ """
+ A "recorder" which only logs certain events to the console.
+ Can be used by passing `--dry-run` when invoking `oaieval`.
+ """
+
+ def __init__(self, run_spec: RunSpec, log: bool = True):
+ super().__init__(run_spec)
+ self.log = log
+
+ def record_event(self, type, data, sample_id=None):
+ from evals.registry import registry
+
+ if self.run_spec is None:
+ return
+
+ base_eval_spec = registry.get_base_eval(self.run_spec.base_eval)
+ if base_eval_spec and len(base_eval_spec.metrics) >= 1:
+ primary_metric = base_eval_spec.metrics[0]
+ else:
+ primary_metric = "accuracy"
+
+ with self._event_lock:
+ event = self._create_event(type, data)
+ self._events.append(event)
+
+ msg = f"Not recording event: {event}"
+
+ if type == "match":
+ accuracy_good = (
+ primary_metric == "accuracy" or primary_metric.startswith("pass@")
+ ) and (data.get("correct", False) or data.get("accuracy", 0) > 0.5)
+ f1_score_good = primary_metric == "f1_score" and data.get("f1_score", 0) > 0.5
+ if accuracy_good or f1_score_good:
+ msg = _green(msg)
+ else:
+ msg = _red(msg)
+
+ if self.log:
+ logging.info(msg)
+
+
+class LocalRecorder(RecorderBase):
+ """
+ A recorder which logs events to the specified JSON file.
+ This is the default recorder used by `oaieval`.
+ """
+
+ def __init__(self, log_path: Optional[str], run_spec: RunSpec):
+ super().__init__(run_spec)
+ self.event_file_path = log_path
+ if log_path is not None:
+ with bf.BlobFile(log_path, "wb") as f:
+ f.write((jsondumps({"spec": dataclasses.asdict(run_spec)}) + "\n").encode("utf-8"))
+
+ def _flush_events_internal(self, events_to_write: Sequence[Event]):
+ start = time.time()
+ try:
+ lines = [jsondumps(event) + "\n" for event in events_to_write]
+ except TypeError as e:
+ logger.error(f"Failed to serialize events: {events_to_write}")
+ raise e
+
+ with bf.BlobFile(self.event_file_path, "ab") as f:
+ f.write(b"".join([l.encode("utf-8") for l in lines]))
+
+ logger.info(
+ f"Logged {len(lines)} rows of events to {self.event_file_path}: insert_time={t(time.time()-start)}"
+ )
+
+ self._last_flush_time = time.time()
+ self._flushes_done += 1
+
+ def record_final_report(self, final_report: Any):
+ with bf.BlobFile(self.event_file_path, "ab") as f:
+ f.write((jsondumps({"final_report": final_report}) + "\n").encode("utf-8"))
+
+ logging.info(f"Final report: {final_report}. Logged to {self.event_file_path}")
+
+
+class Recorder(RecorderBase):
+ """
+ A recorder which logs events to Snowflake.
+ Can be used by passing `--no-local-run` when invoking `oaieval`.
+ """
+
+ def __init__(
+ self,
+ log_path: Optional[str],
+ run_spec: evals.base.RunSpec,
+ snowflake_connection: Optional[SnowflakeConnection] = None,
+ ) -> None:
+ super().__init__(run_spec)
+ self.event_file_path = log_path
+ self._writing_lock = threading.Lock()
+
+ if snowflake_connection is None:
+ snowflake_connection = SnowflakeConnection()
+ self._conn = snowflake_connection
+
+ if log_path is not None:
+ with bf.BlobFile(log_path, "wb") as f:
+ f.write((jsondumps({"spec": dataclasses.asdict(run_spec)}) + "\n").encode("utf-8"))
+
+ query = """
+ INSERT ALL INTO runs (run_id, model_name, eval_name, base_eval, split, run_config, settings, created_by, created_at)
+ VALUES (%(run_id)s, %(model_name)s, %(eval_name)s, %(base_eval)s, %(split)s, run_config, settings, %(created_by)s, %(created_at)s)
+ SELECT PARSE_JSON(%(run_config)s) AS run_config, PARSE_JSON(%(settings)s) AS settings
+ """
+ self._conn.robust_query(
+ command=query,
+ params={
+ "run_id": run_spec.run_id,
+ "model_name": jsondumps(run_spec.model_names),
+ "eval_name": run_spec.eval_name,
+ "base_eval": run_spec.base_eval,
+ "split": run_spec.split,
+ "run_config": jsondumps(run_spec.run_config),
+ "settings": jsondumps(run_spec.run_config.get("initial_settings", {})),
+ "created_by": run_spec.created_by,
+ "created_at": run_spec.created_at,
+ },
+ )
+ atexit.register(self.flush_events)
+
+ def _flush_events_internal(self, events_to_write: Sequence[Event]):
+ with self._writing_lock:
+ try:
+ lines = [jsondumps(event) + "\n" for event in events_to_write]
+ except TypeError as e:
+ logger.error(f"Failed to serialize events: {events_to_write}")
+ raise e
+ idx_l = 0
+ while idx_l < len(events_to_write):
+ total_bytes = 0
+ idx_r = idx_l
+ while (
+ idx_r < len(events_to_write)
+ and total_bytes + len(lines[idx_r]) < MAX_SNOWFLAKE_BYTES
+ ):
+ total_bytes += len(lines[idx_r])
+ idx_r += 1
+ assert idx_r > idx_l
+ start = time.time()
+ buffer = [
+ (
+ event.run_id,
+ event.event_id,
+ event.sample_id,
+ event.type,
+ jsondumps(event.data),
+ event.created_by,
+ event.created_at,
+ )
+ for event in events_to_write[idx_l:idx_r]
+ ]
+ query = """
+ INSERT INTO events (run_id, event_id, sample_id, type, data, created_by, created_at)
+ SELECT Column1 AS run_id, Column2 as event_id, Column3 AS sample_id, Column4 AS type, PARSE_JSON(Column5) AS data, Column6 AS created_by, Column7 AS created_at
+ FROM VALUES(%s, %s, %s, %s, %s, %s, %s)
+ """
+ self._conn.robust_query(command=query, seqparams=buffer, many=True)
+ logger.info(
+ f"Logged {len(buffer)} rows of events to Snowflake: insert_time={t(time.time()-start)}"
+ )
+ idx_l = idx_r
+
+ with bf.BlobFile(self.event_file_path, "ab") as f:
+ f.write(b"".join([l.encode("utf-8") for l in lines]))
+ self._last_flush_time = time.time()
+ self._flushes_done += 1
+
+ def record_final_report(self, final_report: Any):
+ with self._writing_lock:
+ with bf.BlobFile(self.event_file_path, "ab") as f:
+ f.write((jsondumps({"final_report": final_report}) + "\n").encode("utf-8"))
+ query = """
+ UPDATE runs
+ SET final_report = PARSE_JSON(%(final_report)s)
+ WHERE run_id = %(run_id)s
+ """
+ self._conn.robust_query(
+ command=query,
+ params={
+ "run_id": self.run_spec.run_id,
+ "final_report": jsondumps(final_report),
+ },
+ )
+
+ def record_event(self, type, data=None, sample_id=None):
+ # try to serialize data so we fail early!
+ _ = jsondumps(data)
+ return super().record_event(type, data, sample_id)
+
+
+#########################################################################
+### Helper methods which use the thread local global default recorder ###
+#########################################################################
+
+
+def current_sample_id() -> str:
+ return default_recorder().current_sample_id
+
+
+def record_match(correct: bool, *, expected=None, picked=None, **extra):
+ return default_recorder().record_match(correct, expected=expected, picked=picked, **extra)
+
+
+def record_embedding(prompt, embedding_type, **extra):
+ return default_recorder().record_embedding(prompt, embedding_type, **extra)
+
+
+def record_sampling(prompt, sampled, **extra):
+ return default_recorder().record_sampling(prompt, sampled, **extra)
+
+
+def record_cond_logp(prompt, completion, logp, **extra):
+ return default_recorder().record_cond_logp(prompt, completion, logp, **extra)
+
+
+def record_pick_option(prompt, options, picked, **extra):
+ return default_recorder().record_pick_option(prompt, options, picked, **extra)
+
+
+def record_raw(data):
+ return default_recorder().record_raw(data)
+
+
+def record_metrics(**extra):
+ return default_recorder().record_metrics(**extra)
+
+
+def record_error(msg: str, error: Exception = None, **extra):
+ return default_recorder().record_error(msg, error, **extra)
+
+
+def record_extra(data):
+ return default_recorder().record_extra(data)
diff --git a/evals/evals/registry.py b/evals/evals/registry.py
new file mode 100644
index 0000000000000000000000000000000000000000..b80d936d035e860b5b3ee815dd5d0bbc799c6eec
--- /dev/null
+++ b/evals/evals/registry.py
@@ -0,0 +1,174 @@
+"""
+Functions to handle registration of evals. To add a new eval to the registry,
+add an entry in one of the YAML files in the `../registry` dir.
+By convention, every eval name should start with {base_eval}.{split}.
+"""
+
+import difflib
+import functools
+import logging
+import os
+import re
+from functools import partial
+from pathlib import Path
+from typing import Any, Iterator, Sequence, Type, Union
+
+import yaml
+
+from evals.base import BaseEvalSpec, EvalSetSpec, EvalSpec
+from evals.utils.misc import make_object
+
+logger = logging.getLogger(__name__)
+
+DEFAULT_PATHS = [Path(__file__).parents[0].resolve() / "registry", Path.home() / ".evals"]
+
+
+class Registry:
+ def __init__(self, registry_paths: Sequence[Union[str, Path]] = DEFAULT_PATHS):
+ self._registry_paths = [Path(p) if isinstance(p, str) else p for p in registry_paths]
+
+ def make_callable(self, spec):
+ return partial(make_object(spec.cls).create_and_run, **(spec.args or {}))
+
+ def get_class(self, spec: dict) -> Any:
+ return make_object(spec.cls, **(spec.args if spec.args else {}))
+
+ def _dereference(self, name: str, d: dict, object: str, type: Type) -> dict:
+ if not name in d:
+ return None
+
+ def get_alias():
+ if isinstance(d[name], str):
+ return d[name]
+ if isinstance(d[name], dict) and "id" in d[name]:
+ return d[name]["id"]
+ return None
+
+ logger.debug(f"Looking for {name}")
+ while True:
+ alias = get_alias()
+
+ if alias is None:
+ break
+ name = alias
+
+ spec = d[name]
+
+ try:
+ return type(**spec)
+ except TypeError as e:
+ raise TypeError(f"Error while processing {object} {name}: {e}")
+
+ def get_modelgraded_spec(self, name: str) -> dict[str, Any]:
+ assert name in self._modelgraded_specs, (
+ f"Modelgraded spec {name} not found. "
+ f"Closest matches: {difflib.get_close_matches(name, self._modelgraded_specs.keys(), n=5)}"
+ )
+ return self._modelgraded_specs[name]
+
+ def get_eval(self, name: str) -> EvalSpec:
+ return self._dereference(name, self._evals, "eval", EvalSpec)
+
+ def get_eval_set(self, name: str) -> EvalSetSpec:
+ return self._dereference(name, self._eval_sets, "eval set", EvalSetSpec)
+
+ def get_evals(self, patterns: Sequence[str]) -> Iterator[EvalSpec]:
+ # valid patterns: hello, hello.dev*, hello.dev.*-v1
+ def get_regexp(pattern):
+ pattern = pattern.replace(".", "\\.")
+ pattern = pattern.replace("*", ".*")
+ return re.compile(f"^{pattern}$")
+
+ regexps = list(map(get_regexp, patterns))
+ for name in self._evals:
+ # if any regexps match, return the name
+ if any(map(lambda regexp: regexp.match(name), regexps)):
+ yield self.get_eval(name)
+
+ def get_base_evals(self) -> list[BaseEvalSpec]:
+ base_evals = []
+ for name, spec in self._evals.items():
+ if name.count(".") == 0:
+ base_evals.append(self.get_base_eval(name))
+ return base_evals
+
+ def get_base_eval(self, name: str) -> BaseEvalSpec:
+ if not name in self._evals:
+ return None
+
+ spec_or_alias = self._evals[name]
+ if isinstance(spec_or_alias, dict):
+ spec = spec_or_alias
+ try:
+ return BaseEvalSpec(**spec)
+ except TypeError as e:
+ raise TypeError(f"Error while processing base eval {name}: {e}")
+
+ alias = spec_or_alias
+ return BaseEvalSpec(id=alias)
+
+ def _process_file(self, registry, path):
+ with open(path, "r") as f:
+ d = yaml.safe_load(f)
+
+ if d is None:
+ # no entries in the file
+ return
+
+ for name, spec in d.items():
+ assert name not in registry, f"duplicate entry: {name} from {path}"
+ if isinstance(spec, dict):
+ if "key" in spec:
+ raise ValueError(
+ f"key is a reserved keyword, but was used in {name} from {path}"
+ )
+ if "group" in spec:
+ raise ValueError(
+ f"group is a reserved keyword, but was used in {name} from {path}"
+ )
+ if "cls" in spec:
+ raise ValueError(
+ f"cls is a reserved keyword, but was used in {name} from {path}"
+ )
+
+ spec["key"] = name
+ spec["group"] = str(os.path.basename(path).split(".")[0])
+ if "class" in spec:
+ spec["cls"] = spec["class"]
+ del spec["class"]
+ registry[name] = spec
+
+ def _process_directory(self, registry, path):
+ files = Path(path).glob("*.yaml")
+ for file in files:
+ self._process_file(registry, file)
+
+ def _load_registry(self, paths):
+ """Load registry from a list of paths.
+
+ Each path or yaml specifies a dictionary of name -> spec.
+ """
+ registry = {}
+ for path in paths:
+ logging.info(f"Loading registry from {path}")
+ if os.path.exists(path):
+ if os.path.isdir(path):
+ self._process_directory(registry, path)
+ else:
+ self._process_file(registry, path)
+ return registry
+
+ @functools.cached_property
+ def _eval_sets(self):
+ return self._load_registry([p / "eval_sets" for p in self._registry_paths])
+
+ @functools.cached_property
+ def _evals(self):
+ return self._load_registry([p / "evals" for p in self._registry_paths])
+
+ @functools.cached_property
+ def _modelgraded_specs(self):
+ return self._load_registry([p / "modelgraded" for p in self._registry_paths])
+
+
+registry = Registry()
diff --git a/evals/evals/registry/data/README.md b/evals/evals/registry/data/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9c1afe839cfde4a5eb1d9464f478dbd8f842f863
--- /dev/null
+++ b/evals/evals/registry/data/README.md
@@ -0,0 +1,44 @@
+### Registry Data
+
+The JSONL need to be pulled via `git-lfs` / downloaded to view.
+
+Here are some example JSONLs for reference and how they are used in evals. See our [eval templates docs](../../../docs/eval-templates.md) for more details.
+
+`test_match/samples.jsonl` In the associated eval from [`test-basic.yaml`](../evals/test-basic.yaml), we see this data is used in a `Match` class, which means we will check if a completion starts with the value for "ideal" key.
+```json
+{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "Once upon a "}], "ideal": "time"}
+{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "The first US president was "}], "ideal": "George Washington"}
+{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "OpenAI was founded in 20"}], "ideal": "15"}
+```
+Another example of a Match eval is:
+```json
+{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Spell this sentence backwards, character by character: We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer follow-up questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests."}], "ideal": ".stseuqer etairporppani tcejer dna ,sesimerp tcerrocni egnellahc ,sekatsim sti timda ,snoitseuq puwollof rewsna ot TPGtahC rof elbissop ti sekam tamrof eugolaid ehT .yaw lanoitasrevnoc a ni stcaretni hcihw TPGtahC dellac ledom a deniart ev’eW"}
+{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Spell this sentence backwards, character by character: Latencies will vary over time so we recommend benchmarking prior to making deployment decisions"}], "ideal": "snoisiced tnemyolped gnikam ot roirp gnikramhcneb dnemmocer ew os emit revo yrav lliw seicnetaL"}
+{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Spell this sentence backwards, character by character: Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity."}], "ideal": ".ytinamuh fo lla stifeneb—snamuh naht retrams yllareneg era taht smetsys IA—ecnegilletni lareneg laicifitra taht erusne ot si noissim ruO"}
+{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Spell this sentence backwards, character by character: There are several things we think are important to do now to prepare for AGI."}], "ideal": ".IGA rof eraperp ot won od ot tnatropmi era kniht ew sgniht lareves era erehT"}
+```
+
+`test_fuzzy_match/samples.jsonl` In the associated eval from [`test-basic.yaml`](../evals/test-basic.yaml), we see this data is used in a `FuzzyMatch` class, which means we will check if a completion includes a normalized version of the "ideal" key or vice-versa.
+```json
+{"input": [{"role": "system", "content": "Answer the following questions as concisely as possible."}, {"role": "system", "content": "What's the capital of France?", "name": "example_user"}, {"role": "system", "content": "Paris", "name": "example_assistant"}, {"role": "system", "content": "What's 2+2?", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}, {"role": "user", "content": "Who is the girl who plays eleven in stranger things?"}], "ideal": ["Millie Bobby Brown"]}
+{"input": [{"role": "system", "content": "Answer the following questions as concisely as possible."}, {"role": "system", "content": "What's the capital of France?", "name": "example_user"}, {"role": "system", "content": "Paris", "name": "example_assistant"}, {"role": "system", "content": "What's 2+2?", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}, {"role": "user", "content": "What season did derek die in grey's?"}], "ideal": ["Season 11", "11"]}
+{"input": [{"role": "system", "content": "Answer the following questions as concisely as possible."}, {"role": "system", "content": "What's the capital of France?", "name": "example_user"}, {"role": "system", "content": "Paris", "name": "example_assistant"}, {"role": "system", "content": "What's 2+2?", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}, {"role": "user", "content": "Who played the girl elf in the hobbit?"}], "ideal": ["Evangeline Lilly"]}
+```
+
+`logic/samples.jsonl` In the associated eval from [`test-basic.yaml`](../evals/logic.yaml), we see this data is used with a `ModelBasedClassify` class with [`fact`](../modelgraded/fact.yaml) model-graded yaml, which will compare the factual content of the completion against a ground truth.
+```json
+{"input":[{"role":"system","content":"Solve the following logical puzzle. Carefully think step by step, and show your reasoning. If there is not enough information to solve the puzzle, conclude with 'There is not enough information.' There are five students, Anna, Bob and Cynthia, Dan and Eliza. They all tell the truth. Anna is taller than Cynthia. Bob says he's taller than Anna if and only if Eliza is the shortest. Cynthia is taller than Dan. Eliza is shorter than Dan. Who's the tallest in the group? Let's think step by step:"}],"ideal":"Anna > Cynthia > Dan > Eliza. But, based on Bob's statement, there are still two possibilities: 1. Bob is taller than Eliza, making Eliza the shortest, making Bob taller than Anna, making Bob the tallest. 2. Bob is shorter than Eliza: this would still be valid, as Eliza wouldn't be the shortest and therefore Bob isn't taller than Anna. And Anna would be the tallest. So there's not enough information"}
+{"input":[{"role":"system","content":"Laura thinks that Jessica thinks that Angie is only 23 years old. Angie thinks Josie knows where Laura's mother is. Jessica thinks Laura was once an engineer. Josie thinks Laura is friendly. Based on the text, what thoughts do we know that Laura, Jessica, Angie, and Josie have?"}],"ideal":"Laura thinks: Jessica thinks Angie is only 23 years old. Jessica thinks: Laura was once an engineer. Angie thinks: Josie knows where Laura's mother is. Josie thinks: Laura is friendly."}
+{"input":[{"role":"system","content":"At a party, there are 100 people. Some always lie and some always tell the truth. They all know which one of them is a truth-teller and which one is a liar. After the party, you ask each person how many truth-tellers they shook hands with. Each person gives a different answer, ranging from 0 to 99. How many were truth-tellers and how many were liars?"}],"ideal":"There is 1 truth-teller and 99 liars at the party."}
+{"input":[{"role":"system","content":"Two people want to cross a river. The only way to get across is with a boat that they find on one side; but that boat can only take one person at a time. The boat cannot return on its own, and there are no ropes to haul it back, yet both persons manage to cross using the boat. How did they do it?"}],"ideal":"The people are on different sides of the river, so the person on the same side as the boat originally can cross first to bring the boat to the side with the other person, then that person can cross."}
+{"input":[{"role":"system","content":"There are two men. One of them is wearing a red shirt, and the other is wearing a blue shirt. The two men are named Andrew and Bob, but we do not know which is Andrew and which is Bob. The guy in the blue shirt says, 'I am Andrew.' The guy in the red shirt says, 'I am Bob.' If we know that at least one of them lied, then what color shirt is Andrew wearing?"}],"ideal":"Andrew is wearing the red shirt."}
+{"input":[{"role":"system","content":"Which word does NOT belong with the others? A. index B. glossary C. chapter D. book"}],"ideal":"D. book"}
+{"input":[{"role":"system","content":"The day before yesterday, Chris was 7 years old. Next year he'll turn 10. How is this possible?"}],"ideal":"Assuming today is January 1st of any given year: Two days ago, on December 30th, Chris was 7 years old. On December 31st, Chris celebrated his 8th birthday. On December 31st of this year, Chris will celebrate his 9th birthday. By the end of this year, on December 31st, Chris will be 10 years old. Therefore, on January 1st of the following year, Chris will be 10 years old."}
+{"input":[{"role":"system","content":"Inhabitants of an island lie consistently on Tuesdays, Thursdays, and Saturdays, and they tell the truth on the other four days of the week. You have forgotten what day of the week it is, so you ask a passerby. 'Saturday,' he answers. 'And what day will it be tomorrow?' you inquire. 'Wednesday,' he replies. Can you tell what day it is today?"}],"ideal":"Based on answer 1, today cannot be M, W, F, Su, or Sa (lying day). Based on answer 2, today cannot be M, W, F, Su, or Tu (lying day). So, today must be Thursday."}
+{"input":[{"role":"system","content":"You are on an island populated by two tribes. Members of one tribe consistently lie. Members of the other tribe always tell the truth. Tribe members can recognize one another, but you can't tell them apart. You meet two people, C and D on the island. C says, 'Exactly one of us is from the liars tribe.' Which tribe is D from?"}],"ideal":"D is from the Liars tribe."}
+{"input":[{"role":"system","content":"There are five people in a room. Each person will either always tell the truth or always tell a lie. Each person is asked the following question: How many liars are among you? The answers are: \"one\", \"two\", \"three\", \"four\", \"five\". How many liars are in the room?"}],"ideal":"There are four liars."}
+```
+
+### Dataset attributions
+
+This work includes data from the Illinois Intentional Tort Qualitative Dataset, which was compiled by the Qualitative Reasoning Group at Northwestern University. The dataset is freely available under the Creative Commons Attribution 4.0 license from https://www.qrg.northwestern.edu/Resources/caselawcorpus.html
diff --git a/evals/evals/registry/data/aba_mrpc_true_false/samples.jsonl b/evals/evals/registry/data/aba_mrpc_true_false/samples.jsonl
new file mode 100644
index 0000000000000000000000000000000000000000..9fcef1aaeee86b1e138ce52bdf47a030591171bb
--- /dev/null
+++ b/evals/evals/registry/data/aba_mrpc_true_false/samples.jsonl
@@ -0,0 +1,110 @@
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer with general experience not considered competent to handle a case involving a specialized field of law."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "Newly admitted lawyers cannot be as competent as practitioners with long experience."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "In an emergency, a lawyer is allowed to provide assistance in a matter even if they do not have the ordinarily required skill."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must always prepare for legal problems with which they are not familiar by seeking special training."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer cannot accept representation if they do not currently possess the required level of competence."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney may consult with lawyers outside of the attorney's firm on the client's case without telling the client."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney may consult with lawyers inside of the attorney's firm on the client's case without telling the client."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "When lawyers from more than one law firm are providing legal services to a client on a particular matter, they do not need to consult with each other and the client about the scope of their respective representations."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "Maintaining competence does not require a lawyer to keep abreast of changes in the law and its practice."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer cannot ethically represent a client if they disagree with the client's political, economic, social or moral views related to the matter."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may limit the scope of representation if it is reasonable and the client gives informed consent."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "In a criminal case, the lawyer has the final say in decisions regarding the client's plea, whether to waive jury trial, and whether the client will testify."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer can counsel a client to engage in conduct that the lawyer knows is criminal or fraudulent."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer can counsel a client regarding the consequences of conduct that the lawyer knows is criminal or fraudulent."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer's representation of a client does not imply endorsement of the client's views or activities."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is allowed to draft and deliver documents that they know are fraudulent in order to assist their client."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "When a lawyer discovers that their client's ongoing conduct is criminal or fraudulent, the lawyer must report the client to the authorities."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may provide limited representation to a client as long as the limitation is reasonable and the client is properly informed."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is prohibited from discussing the legal consequences of a proposed course of conduct with a client if that conduct is clearly criminal."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "If a lawyer knows that a client expects assistance not permitted by the Rules of Professional Conduct, the lawyer must immediately withdraw from the representation."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is permitted to charge a contingent fee in a domestic relations matter, where the payment is contingent upon the securing of a divorce or the amount of alimony or support, when the client is unable to pay otherwise."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must communicate the scope of representation and the basis or rate of the fee and expenses to the client in writing before the representation commences in all cases."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may charge a client an unreasonable fee or an unreasonable amount for expenses, as long as the client agrees to it."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is permitted to enter into a fee arrangement that might induce the lawyer to curtail services for the client or perform them in a way contrary to the client's interest."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A contingent fee agreement must be orally agreed upon between the lawyer and client, stating the method by which the fee is to be determined."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may accept property in payment for services, such as an ownership interest in an enterprise, without any limitations or restrictions."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may divide a fee with another lawyer who is not in the same firm, even if the client does not agree to the arrangement."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is required to return any unearned portion of a fee if the client fires them without warning."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may charge a contingent fee for representing a defendant in a criminal case."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "When determining the reasonableness of a fee, factors such as the time and labor required, the novelty and difficulty of the questions involved, and the skill requisite to perform the legal service properly should be considered."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is allowed to reveal information relating to the representation of a client if they believe it is necessary to prevent the client from committing a crime.."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is allowed to reveal information relating to the representation of a client if they believe it is necessary to prevent the client from committing a crime that may result in financial injury to another person."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is required to reveal information relating to the representation of a client if it is necessary to prevent the client from committing a crime that may result in financial injury to another person."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may reveal information relating to the representation of a client to detect and resolve conflicts of interest arising from changes in the composition or ownership of a firm, even if the revealed information would compromise the attorney-client privilege."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may disclose information relating to the representation of a client if the disclosure is impliedly authorized to carry out the representation."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is prohibited from revealing any information relating to the representation of a client, regardless of the circumstances."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may disclose information relating to the representation of a client to secure legal advice about the lawyer's compliance with the Rules of Professional Conduct."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is allowed to reveal information relating to the representation of a client to establish a defense in a controversy between the lawyer and the client."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must disclose unprotected information relating to the representation of a client, even if the disclosure could reasonably lead to the discovery of protected information by a third party."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is allowed to disclose information relating to a client's representation to detect and resolve conflicts of interest without the client's consent."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must take reasonable efforts to prevent unauthorized access to or disclosure of client information, even if the lawyer is not at fault for the unauthorized access or disclosure."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is not required to take any special security measures when transmitting client information if the method of communication affords a reasonable expectation of privacy."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "The duty of confidentiality ends when the client-lawyer relationship is terminated."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must always appeal an adverse ruling regarding disclosure of client information, regardless of the client's wishes."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is permitted to disclose client information if a court or other tribunal orders the disclosure, even without the client's informed consent."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may represent a client with a concurrent conflict of interest without obtaining informed consent from each affected client, as long as the lawyer reasonably believes they can provide competent and diligent representation."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is allowed to represent a client if the representation involves asserting a claim by one client against another client represented by the same lawyer in the same litigation."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may represent clients with economically adverse interests in unrelated matters without obtaining their informed consent."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer's duty of loyalty and independent judgment can be materially limited by responsibilities to a former client."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is permitted to represent a client in a transaction where the lawyer's own conduct is in question, as long as the lawyer can provide competent and diligent representation."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may engage in sexual relationships with a client if the relationship predates the formation of the client-lawyer relationship."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer who is closely related by blood or marriage to another lawyer may represent a client in a matter where the other lawyer represents another party, without obtaining informed consent from each client."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may accept payment from a source other than the client if the client is informed and consents, and the arrangement does not compromise the lawyer's duty of loyalty or independent judgment."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A parent who pays for the lawyer's services on behalf of a child may direct or control some of the legal strategy decisions made."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "In all cases, clients can consent to a representation where a conflict of interest exists."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "Representation is prohibited if the lawyer cannot reasonably conclude that they will be able to provide competent and diligent representation."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "Informed consent requires the client to be aware of the possible effects on loyalty, confidentiality, and the attorney-client privilege when multiple clients are represented in a single matter."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "A client who has given consent to a conflict cannot revoke the consent or terminate the lawyer's representation at any time."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "A client who has given consent to a conflict cannot revoke the consent or terminate the lawyer's representation if the lawyer reasonably and honestly believes the timing would harm the client's interests."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "General and open-ended advance consent to future conflicts is considered effective."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "A lawyer may not take inconsistent legal positions in different tribunals at different times on behalf of different clients."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the information provided about conflict of interest."}, {"role": "user", "content": "A lawyer is required to obtain informed consent from a client, confirmed in writing, when there is a potential conflict of interest."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is prohibited from representing multiple parties to a negotiation if their interests are fundamentally antagonistic to each other."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer can represent multiple clients with generally aligned interests even if there are some differences in interest among them."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must always maintain impartiality between commonly represented clients."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "As between commonly represented clients, the attorney-client privilege does not attach."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer representing an organization also represents all of its affiliated organizations, such as parent and subsidiary companies."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer who is a member of a corporation's board of directors must resign from the board or cease acting as the corporation's lawyer when a conflict of interest arises."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "In common representation, if one client asks the lawyer not to disclose information relevant to the representation to the other client, the lawyer must withdraw from representing both clients."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer representing multiple clients in the same matter should consider the potential additional cost, embarrassment, and recrimination if the common representation fails."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "When seeking to establish or adjust a relationship between clients, the lawyer's role is that of partisanship normally expected in other circumstances."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "In a law firm, a lawyer's disqualification due to a personal interest will result in the disqualification of all other lawyers in the firm."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer who moves from one firm to another can be screened from participation in a matter, and the new firm can represent a client with adverse interests without obtaining the former client's informed consent."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A law firm is prohibited from representing a client with interests adverse to those of a client represented by a formerly associated lawyer."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A disqualification prescribed by Rule 1.10 may be waived by the affected client under the conditions stated in Rule 1.7."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer's disqualification based on prior work as a law student will result in the disqualification of all other lawyers in the firm."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney is allowed to touch and move contraband on behalf of the client."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "In ex parte proceedings, an attorney is not required to reveal information that may be harmful to their client's case."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A prosecutor is obligated to timely disclose favorable evidence to the defense, even if it is inadmissible or has no impact on the outcome of the case."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "It is acceptable for an attorney to communicate directly with a person who is represented by counsel on a specific matter without the consent of their counsel."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A prosecutor has a duty to protect the accused's right to counsel."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer can make false statements of fact to adversaries and third parties."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney must not act with the sole purpose of delaying, burdening, or embarrassing other parties while obtaining evidence."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "Lawyers are allowed to make out-of-court statements that are completely true, but that they reasonably should know have a substantial likelihood of materially prejudicing the case."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "It is acceptable for a prosecutor to make true comments that have a substantial likelihood of heightening public condemnation of the accused."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer must self-report when they know they have violated the rules of professional conduct."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "Attorneys have an affirmative duty to expedite cases and should not delay cases for their own personal gain or convenience."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A licensed attorney may practice law in a jurisdiction where they are not licensed in an emergency situation if it can avoid a substantial injustice."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney must report any attorney or judge's violation of the Rules if it raises a substantial question as to their honesty, trustworthiness, or fitness as a lawyer."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A subordinate lawyer who follows an order to take an action in violation of the Rules is not subject to discipline if the ethical responsibility is debatable."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney has no duty to follow valid procedural rules and court orders that they reasonably believe are defective or invalid."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "Attorneys must not engage in conduct involving dishonesty, fraud, deceit, or misrepresentation, even in their private business or personal life that is unrelated to the practice of law."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "It is permissible for an attorney to talk to members of the jury before or during a trial, so long as the discussion is not about the trial."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "An attorney may be disciplined for failing to prevent ethical violations of other members of their law firm."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer may sell certain cases of their law practice and retain others, depending on how valuable they are."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer who sells their entire practice may subsequently work as in-house counsel for a business."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "During the sale of a law practice, the seller must obtain client consent before sharing detailed information about a client's case with the potential buyer."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "If a client does not respond within 60 days after receiving notice of the proposed sale of their lawyer's practice, their consent to the transfer of their files is presumed."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "If a client does not respond within 90 days after receiving notice of the proposed sale of their lawyer's practice, their consent to the transfer of their files is presumed."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "It is a violation of the ABA Model Rules of Professional Conduct for a lawyer to sell only a specific area of their law practice."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer who sells an area of their practice must cease accepting any matters in that area, including as counsel or co-counsel, after the sale."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer selling their practice is allowed to share client confidences with the potential buyer without client consent during preliminary negotiations."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer is required to render pro bono legal services each year."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer should aspire to render at least 10 hours of pro bono legal services each year."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer should aspire to render at least 25 hours of pro bono legal services each year."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer should aspire to render at least 50 hours of pro bono legal services each year."}], "ideal": "True"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer should aspire to render at least 100 hours of pro bono legal services each year."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer shall not represent anyone in connection with a matter in which the lawyer participated personally and substantially as a judge or other adjudicative officer or law clerk to such a person or as an arbitrator, mediator or other third-party neutral, unless 3 years have passed after the laywer's last point of involvement."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer shall not represent anyone in connection with a matter in which the lawyer participated personally and substantially as a judge or other adjudicative officer or law clerk to such a person or as an arbitrator, mediator or other third-party neutral, unless 5 years have passed after the laywer's last point of involvement."}], "ideal": "False"}
+{"input": [{"role": "system", "content": "You are LawStudentGPT. Answer the following True/False question according to the ABA Model Rules of Professional Conduct."}, {"role": "user", "content": "A lawyer shall not represent anyone in connection with a matter in which the lawyer participated personally and substantially as a judge or other adjudicative officer or law clerk to such a person or as an arbitrator, mediator or other third-party neutral, unless 10 years have passed after the laywer's last point of involvement."}], "ideal": "False"}
\ No newline at end of file
diff --git a/evals/evals/registry/data/actors-sequence/samples.jsonl b/evals/evals/registry/data/actors-sequence/samples.jsonl
new file mode 100644
index 0000000000000000000000000000000000000000..f18f6ee1e1cb30a47eea3486e730ab4f55416b56
--- /dev/null
+++ b/evals/evals/registry/data/actors-sequence/samples.jsonl
@@ -0,0 +1,100 @@
+{"input": [{"role": "system", "content": "This is an exchange between Merlin, Arthur and Lancelot. Merlin outputs short tests. First output is: '1+2', second is: '3+4', ... Arthur outputs the sum of last two digits that he had seen. Lancelot outputs an increasing integers number, starting from 0. The sequence in which the actors act: MAMALAMA. Output that exchange using the following syntax (including the trailing comma): Actor: