Spaces:
Running
Running
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">Generative AI Leaderboard for CRM</h1> | |
<h3>Assess which LLMs are accurate enough or need fine-tuning, and weigh this versus tradeoffs of speed, costs, and trust and safety. This is based on human manual and automated evaluation with real operational CRM data per use case.</h3> | |
""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
""" | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = """ | |
1) GPT-4T was used except for some accuracy use cases with atypically long input tokens. | |
2) Hyperparameters were optimized for a subset of models evaluated (platform models?) Were parameters optimized as well? | |
3) Latency reflects the mean latency over a single time range on a high-speed internet connection; response times for external APIs may vary over time and be impacted by internet speed, location, etc. | |
3) Latency reflects the time to receive the entire completion. | |
4) Some external APIs were direct to the LLM provider (OpenAI, Google, AI21), while others were provided through Amazon Bedrock (Cohere, Anthropic) | |
5) LLM annotations (manual/human evaluations) were performed under a variety of settings that did not necessarily control for ordering effects | |
6) All tests on open source models were performed on original models (correct?); custom fine-tuning may impact performance in trust / safety / toxicity / bias / etc. | |
7) For the tests on latency, the inputs were *approximately* 500 / 3000 tokens. A short prompt was added and different models tokenize differently. | |
8) Costs for all external APIs were based on the standard pricing of the provider (note that the pricing of cohere/anthropic via Bedrock is the same as directly through Cohere/Anthropic apis). | |
9) Something about limitations of LLM judges (despite correlation with human annotators) | |
10) Task-specific model variants were not used from the external providers (command-r is sort of retrieval specific, but this was not one of the use cases) | |
11) Maybe something about the tasks being primarily summarization / generation | |
12) CRM T&S is done by perturbing words: 1) for gender bias, we perturb person names and pronouns to opposite gender. 2) for entity bias, we perturb company names to its competitors in the same sector | |
13) Cost per request for self-hosted models assume a minimal frequency of calling the model, since the costs are per hour. All latencies / cost assume a single user at a time. | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
## Some good practices before submitting a model | |
### 1) Make sure you can load your model and tokenizer using AutoClasses: | |
```python | |
from transformers import AutoConfig, AutoModel, AutoTokenizer | |
config = AutoConfig.from_pretrained("your model name", revision=revision) | |
model = AutoModel.from_pretrained("your model name", revision=revision) | |
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) | |
``` | |
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. | |
Note: make sure your model is public! | |
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted! | |
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) | |
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! | |
### 3) Make sure your model has an open license! | |
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 | |
### 4) Fill up your model card | |
When we add extra information about models to the leaderboard, it will be automatically taken from the model card | |
## In case of model failure | |
If your model is displayed in the `FAILED` category, its execution stopped. | |
Make sure you have followed the above steps first. | |
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task). | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r""" | |
@misc{crm-llm-leaderboard, | |
author = {Salesforce AI}, | |
title = {Generative AI Leaderboard for CRM}, | |
year = {2024}, | |
publisher = {Salesforce AI}, | |
howpublished = "\url{https://https://huggingface.co/spaces/Salesforce/crm_llm_leaderboard}" | |
} | |
""" | |