Spaces:

toloka
/

open-llm-leaderboard

Running

App Files Files Community

pavlichenko commited on Nov 1, 2023

Commit

fcf7e29

•

1 Parent(s): a084e60

Update app.py

Browse files

Files changed (1) hide show

app.py +18 -11

app.py CHANGED Viewed

@@ -2,6 +2,7 @@ import streamlit as st
 import requests
 from collections import defaultdict
 import pandas as pd
 header = """Toloka compared and ranked LLM output in multiple categories, using Guanaco 13B as the baseline.
@@ -25,19 +26,23 @@ The alternative is to use open-source prompts, but they are not reliable enough
 To mitigate these issues, we collected organic prompts sent to ChatGPT (some were submitted by Toloka employees, and some we found on the internet, but all of them were from real conversations with ChatGPT). These prompts are the key to accurate evaluation — **we can be certain that the prompts represent real-world use cases, and they were not used in any LLM training sets.** We store the dataset securely and reserve it solely for use in this particular evaluation.
-After collecting the prompts, we manually classified them by category and got the following distribution:
-* Brainstorming: 15.48%
-* Chat: 1.59%
-* Classification: 0.2%
-* Closed QA: 3.77%
-* Extraction: 0.6%
-* Generation: 38.29%
-* Open QA: 32.94%
-* Rewrite: 5.16%
-* Summarization: 1.98%
-We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
 #### 🧠 Stage 2: Human evaluation
@@ -142,6 +147,8 @@ st.dataframe(
     }
 )
 st.markdown(description)
 st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
 prompt_examples = """
 ### 🔍 Prompt Examples

 import requests
 from collections import defaultdict
 import pandas as pd
+import plotly.graph_objects as go
 header = """Toloka compared and ranked LLM output in multiple categories, using Guanaco 13B as the baseline.
 To mitigate these issues, we collected organic prompts sent to ChatGPT (some were submitted by Toloka employees, and some we found on the internet, but all of them were from real conversations with ChatGPT). These prompts are the key to accurate evaluation — **we can be certain that the prompts represent real-world use cases, and they were not used in any LLM training sets.** We store the dataset securely and reserve it solely for use in this particular evaluation.
+After collecting the prompts, we manually classified them by category and got the following distribution:"""
+# * Brainstorming: 15.48%
+# * Chat: 1.59%
+# * Classification: 0.2%
+# * Closed QA: 3.77%
+# * Extraction: 0.6%
+# * Generation: 38.29%
+# * Open QA: 32.94%
+# * Rewrite: 5.16%
+# * Summarization: 1.98%
+fig = go.Figure(
+    data=[go.Bar(y=[38.29, 32.94, 15.48, 5.16, 3.77, 1.98, 1.59, 0.6, 0.2]), x=["Generation", "Open QA", "Brainstorming", "Rewrite", "Closed QA", "Summarization", "Chat", "Extraction", "Classification"]],
+)
+description2 = """We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
 #### 🧠 Stage 2: Human evaluation
     }
 )
 st.markdown(description)
+st.plotly_chart(fig, theme="streamlit")
+st.markdown(description2)
 st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
 prompt_examples = """
 ### 🔍 Prompt Examples