Commit
•
9990990
1
Parent(s):
7bb4253
Updating explanations
Browse files
Dockerfile
CHANGED
@@ -6,8 +6,17 @@ ENV PATH="/home/user/.local/bin:$PATH"
|
|
6 |
|
7 |
WORKDIR /app
|
8 |
|
|
|
9 |
COPY --chown=user ./requirements.txt requirements.txt
|
10 |
RUN pip install --no-cache-dir --upgrade -r requirements.txt
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
COPY --chown=user . /app
|
13 |
-
ENTRYPOINT ["reflex", "run", "--env", "
|
|
|
6 |
|
7 |
WORKDIR /app
|
8 |
|
9 |
+
# Install python packages
|
10 |
COPY --chown=user ./requirements.txt requirements.txt
|
11 |
RUN pip install --no-cache-dir --upgrade -r requirements.txt
|
12 |
|
13 |
+
## Download all npm dependencies and compile frontend
|
14 |
+
#RUN reflex export --frontend-only --no-zip && mv .web/_static/* /srv/ && rm -rf .web
|
15 |
+
#
|
16 |
+
## Needed until Reflex properly passes SIGTERM on backend.
|
17 |
+
#STOPSIGNAL SIGKILL
|
18 |
+
|
19 |
+
ENV PATH="/app/prompt_order_experiment:$PATH"
|
20 |
+
|
21 |
COPY --chown=user . /app
|
22 |
+
ENTRYPOINT ["reflex", "run", "--env", "dev", "--loglevel", "debug" ]
|
prompt_order_exeriment/pages/__init__.py
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
from .index import page as index
|
2 |
+
from .overview import page as overview
|
3 |
+
from .results import page as results
|
prompt_order_exeriment/pages/index.py
ADDED
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import reflex as rx
|
2 |
+
|
3 |
+
from ..sidebar import sidebar
|
4 |
+
|
5 |
+
p1 = '''
|
6 |
+
# Prompt Order Experiment
|
7 |
+
This experiment aims to explore various scenarios for **prompt fine-tuning** using structured generation. We'll test how the order of elements in a prompt affects model performance. The elements we consider are:
|
8 |
+
- **(Q)**: Question
|
9 |
+
- **(AC)**: Answer Choices
|
10 |
+
- **(R)**: Reasoning
|
11 |
+
- **(FA)**: Final Answer
|
12 |
+
|
13 |
+
## Scenarios
|
14 |
+
We will evaluate the following prompt orders:
|
15 |
+
|
16 |
+
### **Scenario 1: Q - AC - R - FA** (Mistral and GPT3.5)
|
17 |
+
|
18 |
+
This is the most natural order. The model generates reasoning before the final answer, providing the most information prior to making a selection. This order leverages decoding mechanics effectively.
|
19 |
+
|
20 |
+
This is our user message, we can see the question and answer choices
|
21 |
+
|
22 |
+
<details>
|
23 |
+
<summary>Click to show prompt!</summary>
|
24 |
+
|
25 |
+
```text
|
26 |
+
<seq>[INST] Answer the Question and include your Reasoning and the Final Answer in a json like: {"Reasoning: "...", "Final Answer": "x"} where x is a letter that corresponds to the answer choice which is a letter between a and h.
|
27 |
+
Question: What is satellite technology used for predicting?
|
28 |
+
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius[/INST]
|
29 |
+
```
|
30 |
+
|
31 |
+
This is our assistant message, you can see that we are forcing a JSON (note I added spacing for visual purposes), and we are putting the reasoning first. Using a JSON in fine-tuning will improve our structured generation results as the model will get used to responding in that "space".
|
32 |
+
```text
|
33 |
+
{
|
34 |
+
"Reasoning": "a) Seconds and minutes: This option is incorrect because satellite technology is not used for predicting time intervals. Satellite technology is used for various purposes such as communication, navigation, and weather forecasting, but it is not used for predicting time intervals.\n\nb) The strength and magnitude of an earthquake: This option is incorrect because satellite technology is not used for predicting earthquakes. Earthquake prediction is a complex process that involves seismology and other scientific methods, but satellite technology is not one of them.\n\nd) 70-75 degrees Fahrenheit: This option is incorrect because satellite technology is not used for predicting specific temperature ranges. While satellite technology can provide temperature data, it is not used for predicting specific temperature ranges.\n\ne) Rapid changes occur: This option is too vague and does not provide enough information to determine whether it is correct or not. Satellite technology can be used to monitor changes in various environmental factors, but it is not used specifically for predicting rapid changes.\n\nf) Dead-ends and false starts: This option is incorrect because it is not related to satellite technology or any type of prediction.\n\ng) Snow, ice, and rock: This option is incorrect because it is too specific and does not cover the broad range of predictions that satellite technology can be used for. While satellite technology can be used to monitor snow, ice, and rock formations, it is not used exclusively for this purpose.\n\nh) Around 5 to 27 degrees Celsius: This option is incorrect because it is too specific and does not cover the broad range of temperature predictions that satellite technology can be used for. While satellite technology can provide temperature data, it is not used exclusively for predicting temperatures within a specific range.\n\nTherefore, the correct answer is c) what it's like outside each day, as satellite technology is commonly used for weather forecasting and predicting daily weather conditions.",
|
35 |
+
"Final Answer": "c"
|
36 |
+
}</seq>
|
37 |
+
```
|
38 |
+
</details>
|
39 |
+
|
40 |
+
### **Scenario 2: Q - AC - FA - R** (Mistral and GPT3.5)
|
41 |
+
|
42 |
+
An awkward order, placing reasoning after the final answer. While it is faster, it assumes the model can "know" reasoning internally before generating it. This approach saves tokens but is a skeptical case worth testing.
|
43 |
+
|
44 |
+
<details>
|
45 |
+
<summary>Click to show prompt!</summary>
|
46 |
+
```text
|
47 |
+
<seq>[INST] Answer the Question and include your Final Answer and the Reasoning in a json like: {"Final Answer": "x", "Reasoning: "..."} where x is a letter that corresponds to the answer choice which is a letter between a and h.
|
48 |
+
Question: What is satellite technology used for predicting?
|
49 |
+
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius[/INST]
|
50 |
+
```
|
51 |
+
|
52 |
+
```text
|
53 |
+
{
|
54 |
+
'Final Answer': 'c',
|
55 |
+
'Reasoning': "a) Seconds and minutes: This option is incorrect because satellite technology is not used for predicting time intervals. Satellite technology is used for various purposes such as communication, navigation, and weather forecasting, but it is not used for predicting time intervals.\n\nb) The strength and magnitude of an earthquake: This option is incorrect because satellite technology is not used for predicting earthquakes. Earthquake prediction is a complex process that involves seismology and other scientific methods, but satellite technology is not one of them.\n\nd) 70-75 degrees Fahrenheit: This option is incorrect because satellite technology is not used for predicting specific temperature ranges. While satellite technology can provide temperature data, it is not used for predicting specific temperature ranges.\n\ne) Rapid changes occur: This option is too vague and does not provide enough information to determine whether it is correct or not. Satellite technology can be used to monitor changes in various environmental factors, but it is not used specifically for predicting rapid changes.\n\nf) Dead-ends and false starts: This option is incorrect because it is not related to satellite technology or any type of prediction.\n\ng) Snow, ice, and rock: This option is incorrect because it is too specific and does not cover the broad range of predictions that satellite technology can be used for. While satellite technology can be used to monitor snow, ice, and rock formations, it is not used exclusively for this purpose.\n\nh) Around 5 to 27 degrees Celsius: This option is incorrect because it is too specific and does not cover the broad range of temperature predictions that satellite technology can be used for. While satellite technology can provide temperature data, it is not used exclusively for predicting temperatures within a specific range.\n\nTherefore, the correct answer is c) what it's like outside each day, as satellite technology is commonly used for weather forecasting and predicting daily weather conditions."
|
56 |
+
}</seq>
|
57 |
+
```
|
58 |
+
</details>
|
59 |
+
|
60 |
+
### **Scenario 3: Q - AC - FA**
|
61 |
+
|
62 |
+
This serves as a fine-tuning control. No reasoning is provided in the output.
|
63 |
+
|
64 |
+
### **Scenario 4: Base**
|
65 |
+
|
66 |
+
An un-fine-tuned control for comparison purposes.
|
67 |
+
|
68 |
+
### Structured Generation
|
69 |
+
Structured generation ensures consistent response formats, which is crucial for reliable fine-tuning. Initial experiments faced difficulties with response consistency and structured generation can solve this.
|
70 |
+
'''
|
71 |
+
|
72 |
+
|
73 |
+
def page():
|
74 |
+
return rx.hstack(
|
75 |
+
sidebar(),
|
76 |
+
rx.vstack(
|
77 |
+
rx.markdown(p1),
|
78 |
+
)
|
79 |
+
)
|
prompt_order_exeriment/pages/overview.py
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import reflex as rx
|
2 |
+
|
3 |
+
from ..sidebar import sidebar
|
4 |
+
|
5 |
+
p2 = '''
|
6 |
+
# Steps
|
7 |
+
### Dataset Selection
|
8 |
+
We begin with the [layoric/labeled-multiple-choice-explained](https://huggingface.co/datasets/layoric/labeled-multiple-choice-explained) dataset, which includes reasoning provided by GPT-3.5-turbo. reasoning explanations serve as a starting point but may differ from Mistral's reasoning style.
|
9 |
+
|
10 |
+
0. *[00-poe-generate-mistral-reasoning.ipynb](https://huggingface.co/derek-thomas/prompt-order-experiment/blob/main/00-poe-generate-mistral-reasoning.ipynb)*: To align with Mistral, we need to create a refined dataset: [derek-thomas/labeled-multiple-choice-explained-mistral-reasoning](https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-mistral-reasoning).
|
11 |
+
1. *[01-poe-dataset-creation.ipynb](https://huggingface.co/derek-thomas/prompt-order-experiment/blob/main/01-poe-dataset-creation.ipynb)*: Then we need to create our prompt experiments.
|
12 |
+
2. *[02-autotrain.ipynb](https://huggingface.co/derek-thomas/prompt-order-experiment/blob/main/02-autotrain.ipynb)*: We generate autotrain jobs on spaces to train our models.
|
13 |
+
3. *[03-poe-token-count-exploration.ipynb](https://huggingface.co/derek-thomas/prompt-order-experiment/blob/main/03-poe-token-count-exploration.ipynb)*: We do some quick analysis so we can optimize our TGI settings.
|
14 |
+
4. *[04-poe-eval.ipynb](https://huggingface.co/derek-thomas/prompt-order-experiment/blob/main/04-poe-eval.ipynb)*: We finally evaluate our trained models.
|
15 |
+
|
16 |
+
**The flowchart is _Clickable_**
|
17 |
+
'''
|
18 |
+
|
19 |
+
|
20 |
+
def mermaid_svg():
|
21 |
+
with open('assets/prompt-order-experiment.svg', 'r') as file:
|
22 |
+
svg_content = file.read()
|
23 |
+
|
24 |
+
return rx.html(
|
25 |
+
f'<div style="width: 300%; height: auto;">{svg_content}</div>'
|
26 |
+
)
|
27 |
+
|
28 |
+
|
29 |
+
def overview():
|
30 |
+
return rx.hstack(
|
31 |
+
sidebar(),
|
32 |
+
rx.vstack(
|
33 |
+
rx.markdown(p2),
|
34 |
+
mermaid_svg(),
|
35 |
+
)
|
36 |
+
)
|
37 |
+
|
38 |
+
|
39 |
+
def page():
|
40 |
+
return rx.hstack(
|
41 |
+
sidebar(),
|
42 |
+
rx.vstack(
|
43 |
+
rx.markdown(p2),
|
44 |
+
mermaid_svg(),
|
45 |
+
)
|
46 |
+
)
|
prompt_order_exeriment/pages/results.py
ADDED
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import reflex as rx
|
2 |
+
import pandas as pd
|
3 |
+
import plotly.express as px
|
4 |
+
from reflex_ag_grid import ag_grid
|
5 |
+
from sklearn.metrics import accuracy_score
|
6 |
+
from ..sidebar import sidebar
|
7 |
+
from datasets import load_dataset
|
8 |
+
|
9 |
+
chart_md = """
|
10 |
+
Make sure you explore what happeened between:
|
11 |
+
- Base Model -> Final Answer
|
12 |
+
- Base Model -> Reasoning (Both models) Final Answer
|
13 |
+
- Base Model -> Final Answer Reasoning (Both models)
|
14 |
+
"""
|
15 |
+
|
16 |
+
# Load the HF dataset
|
17 |
+
dataset = load_dataset("derek-thomas/labeled-multiple-choice-explained-mistral-results")
|
18 |
+
|
19 |
+
# Convert the dataset to a Pandas DataFrame
|
20 |
+
df = dataset['train'].to_pandas()
|
21 |
+
|
22 |
+
# Columns to analyze
|
23 |
+
cols_to_analyze = [
|
24 |
+
"predictions_base",
|
25 |
+
"predictions_FA",
|
26 |
+
"predictions_RFA_mistral",
|
27 |
+
"predictions_FAR_mistral",
|
28 |
+
"predictions_RFA_gpt3_5",
|
29 |
+
"predictions_FAR_gpt3_5",
|
30 |
+
]
|
31 |
+
|
32 |
+
# Mapping for renaming models
|
33 |
+
model_names = {
|
34 |
+
"predictions_base": "Base Model",
|
35 |
+
"predictions_FA": "Final Answer",
|
36 |
+
"predictions_RFA_mistral": "Reasoning (Mistral) -> Final Answer)",
|
37 |
+
"predictions_FAR_mistral": "Final Answer -> Reasoning (Mistral)",
|
38 |
+
"predictions_RFA_gpt3_5": "Reasoning (GPT-3.5 ) -> Final Answer",
|
39 |
+
"predictions_FAR_gpt3_5": "Final Answer -> Reasoning(GPT-3.5)",
|
40 |
+
}
|
41 |
+
|
42 |
+
# Compute metrics for each model
|
43 |
+
metrics_data = []
|
44 |
+
for col in cols_to_analyze:
|
45 |
+
accuracy = round(accuracy_score(df["answer_key"], df[col]) * 100, 2)
|
46 |
+
metrics_data.append({"Prediction Type": model_names[col], "Accuracy (%)": accuracy})
|
47 |
+
|
48 |
+
# Create a DataFrame for metrics
|
49 |
+
metrics_df = pd.DataFrame(metrics_data)
|
50 |
+
|
51 |
+
# Column definitions for the metrics table
|
52 |
+
metrics_column_defs = [
|
53 |
+
ag_grid.column_def(field="Prediction Type", header_name="Prediction Type", width=250),
|
54 |
+
ag_grid.column_def(field="Accuracy (%)", header_name="Accuracy (%)"),
|
55 |
+
]
|
56 |
+
|
57 |
+
|
58 |
+
# Function to generate the topic performance star chart
|
59 |
+
def topic_star_chart():
|
60 |
+
# Calculate per-topic accuracy
|
61 |
+
topic_accuracy = []
|
62 |
+
for topic in df["topic"].unique():
|
63 |
+
topic_data = df[df["topic"] == topic]
|
64 |
+
for col in cols_to_analyze:
|
65 |
+
accuracy = round((topic_data[col] == topic_data["answer_key"]).mean() * 100, 2)
|
66 |
+
topic_accuracy.append({"Topic": topic, "Prediction Type": model_names[col], "Accuracy (%)": accuracy})
|
67 |
+
|
68 |
+
# Create DataFrame for visualization
|
69 |
+
topic_df = pd.DataFrame(topic_accuracy)
|
70 |
+
|
71 |
+
# Find the top 10 topics by number of rows
|
72 |
+
topic_counts = df["topic"].value_counts().head(10).index
|
73 |
+
filtered_topic_df = topic_df[topic_df["Topic"].isin(topic_counts)]
|
74 |
+
|
75 |
+
# Create star chart (radar chart)
|
76 |
+
fig = px.line_polar(
|
77 |
+
filtered_topic_df,
|
78 |
+
r="Accuracy (%)",
|
79 |
+
theta="Topic",
|
80 |
+
color="Prediction Type",
|
81 |
+
title="Top 10 Topics: Per-Topic Performance Star Chart",
|
82 |
+
line_close=True, # Close the lines to form a star shape
|
83 |
+
)
|
84 |
+
fig.update_layout(width=900, height=900)
|
85 |
+
|
86 |
+
return fig
|
87 |
+
|
88 |
+
|
89 |
+
def page():
|
90 |
+
return rx.hstack(
|
91 |
+
sidebar(),
|
92 |
+
rx.vstack(
|
93 |
+
rx.heading("Results", size="9", margin="20px 0"),
|
94 |
+
rx.markdown("Here we have a sortable table of our experiments and the results"),
|
95 |
+
ag_grid(
|
96 |
+
id="ag_grid_metrics",
|
97 |
+
row_data=metrics_df.to_dict("records"),
|
98 |
+
column_defs=metrics_column_defs,
|
99 |
+
width="60%",
|
100 |
+
margin="20px auto", # Center the table
|
101 |
+
size_columns_to_fit=True
|
102 |
+
),
|
103 |
+
rx.markdown("\n---\n"),
|
104 |
+
rx.divider(),
|
105 |
+
rx.heading("Performance Star Chart", size="8", margin="20px 0"),
|
106 |
+
rx.text(
|
107 |
+
"The chart below shows how each model performed across the most popular top 10 topics by row count. "
|
108 |
+
"Each line represents a model, and the radial axis represents accuracy.",
|
109 |
+
font_size="md",
|
110 |
+
padding="10px",
|
111 |
+
),
|
112 |
+
rx.markdown(chart_md),
|
113 |
+
rx.plotly(data=topic_star_chart()), # Render the radar chart
|
114 |
+
padding="20px",
|
115 |
+
)
|
116 |
+
)
|
prompt_order_exeriment/prompt_order_exeriment.py
CHANGED
@@ -1,220 +1,9 @@
|
|
1 |
-
import pandas as pd
|
2 |
-
import plotly.express as px
|
3 |
import reflex as rx
|
4 |
-
from datasets import load_dataset
|
5 |
-
from reflex_ag_grid import ag_grid
|
6 |
-
from sklearn.metrics import accuracy_score
|
7 |
-
|
8 |
-
from .sidebar import sidebar
|
9 |
-
|
10 |
-
# Load the HF dataset
|
11 |
-
dataset = load_dataset("derek-thomas/labeled-multiple-choice-explained-mistral-results")
|
12 |
-
|
13 |
-
# Convert the dataset to a Pandas DataFrame
|
14 |
-
df = dataset['train'].to_pandas()
|
15 |
-
|
16 |
-
# Columns to analyze
|
17 |
-
cols_to_analyze = [
|
18 |
-
"predictions_base",
|
19 |
-
"predictions_FA",
|
20 |
-
"predictions_RFA_mistral",
|
21 |
-
"predictions_FAR_mistral",
|
22 |
-
"predictions_RFA_gpt3_5",
|
23 |
-
"predictions_FAR_gpt3_5",
|
24 |
-
]
|
25 |
-
|
26 |
-
# Mapping for renaming models
|
27 |
-
model_names = {
|
28 |
-
"predictions_base": "Base Model",
|
29 |
-
"predictions_FA": "Final Answer",
|
30 |
-
"predictions_RFA_mistral": "Reasoning (Mistral) -> Final Answer)",
|
31 |
-
"predictions_FAR_mistral": "Final Answer -> Reasoning (Mistral)",
|
32 |
-
"predictions_RFA_gpt3_5": "Reasoning (GPT-3.5 ) -> Final Answer",
|
33 |
-
"predictions_FAR_gpt3_5": "Final Answer -> Reasoning(GPT-3.5)",
|
34 |
-
}
|
35 |
-
|
36 |
-
# Compute metrics for each model
|
37 |
-
metrics_data = []
|
38 |
-
for col in cols_to_analyze:
|
39 |
-
accuracy = round(accuracy_score(df["answer_key"], df[col]) * 100, 2)
|
40 |
-
metrics_data.append({"Prediction Type": model_names[col], "Accuracy (%)": accuracy})
|
41 |
-
|
42 |
-
# Create a DataFrame for metrics
|
43 |
-
metrics_df = pd.DataFrame(metrics_data)
|
44 |
-
|
45 |
-
# Column definitions for the metrics table
|
46 |
-
metrics_column_defs = [
|
47 |
-
ag_grid.column_def(field="Prediction Type", header_name="Prediction Type", width=250),
|
48 |
-
ag_grid.column_def(field="Accuracy (%)", header_name="Accuracy (%)"),
|
49 |
-
]
|
50 |
-
|
51 |
-
|
52 |
-
# Function to generate the topic performance star chart
|
53 |
-
def topic_star_chart():
|
54 |
-
# Calculate per-topic accuracy
|
55 |
-
topic_accuracy = []
|
56 |
-
for topic in df["topic"].unique():
|
57 |
-
topic_data = df[df["topic"] == topic]
|
58 |
-
for col in cols_to_analyze:
|
59 |
-
accuracy = round((topic_data[col] == topic_data["answer_key"]).mean() * 100, 2)
|
60 |
-
topic_accuracy.append({"Topic": topic, "Prediction Type": model_names[col], "Accuracy (%)": accuracy})
|
61 |
-
|
62 |
-
# Create DataFrame for visualization
|
63 |
-
topic_df = pd.DataFrame(topic_accuracy)
|
64 |
-
|
65 |
-
# Find the top 10 topics by number of rows
|
66 |
-
topic_counts = df["topic"].value_counts().head(10).index
|
67 |
-
filtered_topic_df = topic_df[topic_df["Topic"].isin(topic_counts)]
|
68 |
-
|
69 |
-
# Create star chart (radar chart)
|
70 |
-
fig = px.line_polar(
|
71 |
-
filtered_topic_df,
|
72 |
-
r="Accuracy (%)",
|
73 |
-
theta="Topic",
|
74 |
-
color="Prediction Type",
|
75 |
-
title="Top 10 Topics: Per-Topic Performance Star Chart",
|
76 |
-
line_close=True, # Close the lines to form a star shape
|
77 |
-
)
|
78 |
-
fig.update_layout(width=900, height=900)
|
79 |
-
|
80 |
-
return fig
|
81 |
-
|
82 |
-
|
83 |
-
p1 = '''
|
84 |
-
# Prompt Order Experiment
|
85 |
-
## 🎯 Goal
|
86 |
-
This experiment aims to explore various scenarios for **prompt fine-tuning** using structured generation. We'll test how the order of elements in a prompt affects model performance. The elements we consider are:
|
87 |
-
- **(Q)**: Question
|
88 |
-
- **(AC)**: Answer Choices
|
89 |
-
- **(R)**: Reasoning
|
90 |
-
- **(FA)**: Final Answer
|
91 |
-
|
92 |
-
### Scenarios
|
93 |
-
We will evaluate the following prompt orders:
|
94 |
-
|
95 |
-
1. **Scenario 1: `Q - AC - R - FA`** (Mistral and GPT3.5)
|
96 |
-
- This is the most natural order. The model generates reasoning before the final answer, providing the most information prior to making a selection. This order leverages decoding mechanics effectively.
|
97 |
-
|
98 |
-
This is our user message, we can see the question and answer choices
|
99 |
-
<details>
|
100 |
-
<summary>Click to show prompt!</summary>
|
101 |
-
```
|
102 |
-
\<s>[INST] Answer the Question and include your Reasoning and the Final Answer in a json like: {"Reasoning: "...", "Final Answer": "x"} where x is a letter that corresponds to the answer choice which is a letter between a and h.
|
103 |
-
Question: What is satellite technology used for predicting?
|
104 |
-
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius[/INST]
|
105 |
-
```
|
106 |
-
|
107 |
-
This is our assistant message, you can see that we are forcing a JSON (note I added spacing for visual purposes), and we are putting the reasoning first.
|
108 |
-
```
|
109 |
-
{
|
110 |
-
"Reasoning": "a) Seconds and minutes: This option is incorrect because satellite technology is not used for predicting time intervals. Satellite technology is used for various purposes such as communication, navigation, and weather forecasting, but it is not used for predicting time intervals.\n\nb) The strength and magnitude of an earthquake: This option is incorrect because satellite technology is not used for predicting earthquakes. Earthquake prediction is a complex process that involves seismology and other scientific methods, but satellite technology is not one of them.\n\nd) 70-75 degrees Fahrenheit: This option is incorrect because satellite technology is not used for predicting specific temperature ranges. While satellite technology can provide temperature data, it is not used for predicting specific temperature ranges.\n\ne) Rapid changes occur: This option is too vague and does not provide enough information to determine whether it is correct or not. Satellite technology can be used to monitor changes in various environmental factors, but it is not used specifically for predicting rapid changes.\n\nf) Dead-ends and false starts: This option is incorrect because it is not related to satellite technology or any type of prediction.\n\ng) Snow, ice, and rock: This option is incorrect because it is too specific and does not cover the broad range of predictions that satellite technology can be used for. While satellite technology can be used to monitor snow, ice, and rock formations, it is not used exclusively for this purpose.\n\nh) Around 5 to 27 degrees Celsius: This option is incorrect because it is too specific and does not cover the broad range of temperature predictions that satellite technology can be used for. While satellite technology can provide temperature data, it is not used exclusively for predicting temperatures within a specific range.\n\nTherefore, the correct answer is c) what it's like outside each day, as satellite technology is commonly used for weather forecasting and predicting daily weather conditions.",
|
111 |
-
"Final Answer": "c"
|
112 |
-
}</s>
|
113 |
-
```
|
114 |
-
</details>
|
115 |
-
|
116 |
-
2. **Scenario 2: `Q - AC - FA - R`** (Mistral and GPT3.5)
|
117 |
-
- An awkward order, placing reasoning after the final answer. While it is faster, it assumes the model can "know" reasoning internally before generating it. This approach saves tokens but is a skeptical case worth testing.
|
118 |
-
|
119 |
-
<details>
|
120 |
-
<summary>Click to show prompt!</summary>
|
121 |
-
```
|
122 |
-
\<s>[INST] Answer the Question and include your Final Answer and the Reasoning in a json like: {"Final Answer": "x", "Reasoning: "..."} where x is a letter that corresponds to the answer choice which is a letter between a and h.
|
123 |
-
Question: What is satellite technology used for predicting?
|
124 |
-
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius[/INST]
|
125 |
-
```
|
126 |
-
|
127 |
-
```
|
128 |
-
{
|
129 |
-
'Final Answer': 'c',
|
130 |
-
'Reasoning': "a) Seconds and minutes: This option is incorrect because satellite technology is not used for predicting time intervals. Satellite technology is used for various purposes such as communication, navigation, and weather forecasting, but it is not used for predicting time intervals.\n\nb) The strength and magnitude of an earthquake: This option is incorrect because satellite technology is not used for predicting earthquakes. Earthquake prediction is a complex process that involves seismology and other scientific methods, but satellite technology is not one of them.\n\nd) 70-75 degrees Fahrenheit: This option is incorrect because satellite technology is not used for predicting specific temperature ranges. While satellite technology can provide temperature data, it is not used for predicting specific temperature ranges.\n\ne) Rapid changes occur: This option is too vague and does not provide enough information to determine whether it is correct or not. Satellite technology can be used to monitor changes in various environmental factors, but it is not used specifically for predicting rapid changes.\n\nf) Dead-ends and false starts: This option is incorrect because it is not related to satellite technology or any type of prediction.\n\ng) Snow, ice, and rock: This option is incorrect because it is too specific and does not cover the broad range of predictions that satellite technology can be used for. While satellite technology can be used to monitor snow, ice, and rock formations, it is not used exclusively for this purpose.\n\nh) Around 5 to 27 degrees Celsius: This option is incorrect because it is too specific and does not cover the broad range of temperature predictions that satellite technology can be used for. While satellite technology can provide temperature data, it is not used exclusively for predicting temperatures within a specific range.\n\nTherefore, the correct answer is c) what it's like outside each day, as satellite technology is commonly used for weather forecasting and predicting daily weather conditions."
|
131 |
-
}</s>
|
132 |
-
```
|
133 |
-
</details>
|
134 |
-
|
135 |
-
3. **Scenario 3: `Q - AC - FA`**
|
136 |
-
- This serves as a fine-tuning control. No reasoning is provided in the output.
|
137 |
-
|
138 |
-
4. **Scenario 4: Base**
|
139 |
-
- An un-fine-tuned control for comparison purposes.
|
140 |
-
|
141 |
-
### Structured Generation
|
142 |
-
Structured generation ensures consistent response formats, which is crucial for reliable fine-tuning. Initial experiments faced difficulties with response consistency and structured generation can solve this.
|
143 |
-
|
144 |
-
'''
|
145 |
-
|
146 |
-
p2 = '''
|
147 |
-
# Steps
|
148 |
-
### Dataset Selection
|
149 |
-
We begin with the [layoric/labeled-multiple-choice-explained](https://huggingface.co/datasets/layoric/labeled-multiple-choice-explained) dataset, which includes reasoning provided by GPT-3.5-turbo. reasoning explanations serve as a starting point but may differ from Mistral's reasoning style.
|
150 |
-
|
151 |
-
0. 00-poe-generate-mistral-reasoning.ipynb: To align with Mistral, we need to create a refined dataset: [derek-thomas/labeled-multiple-choice-explained-mistral-reasoning](https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-mistral-reasoning) with [poe-generate-mistral-reasoning.ipynb](./poe-generate-mistral-reasoning.ipynb).
|
152 |
-
1. 01-poe-dataset-creation.ipynb: Then we need to create our prompt experiments
|
153 |
-
2. 02-autotrain.ipynb: We generate autotrain jobs on spaces to train our models
|
154 |
-
3. 03-poe-token-count-exploration.ipynb: We do some quick analysis so we can optimize our TGI settings
|
155 |
-
4. 04-poe-eval.ipynb: We finally evaluate our trained models
|
156 |
-
|
157 |
-
'''
|
158 |
-
|
159 |
-
|
160 |
-
def mermaid_svg():
|
161 |
-
with open('assets/prompt-order-experiment.svg', 'r') as file:
|
162 |
-
svg_content = file.read()
|
163 |
-
|
164 |
-
return rx.html(
|
165 |
-
f'<div style="width: 300%; height: auto;">{svg_content}</div>'
|
166 |
-
)
|
167 |
-
|
168 |
-
|
169 |
-
def overview():
|
170 |
-
return rx.hstack(
|
171 |
-
sidebar(),
|
172 |
-
rx.vstack(
|
173 |
-
rx.markdown(p2),
|
174 |
-
mermaid_svg(),
|
175 |
-
)
|
176 |
-
)
|
177 |
-
|
178 |
-
|
179 |
-
def results():
|
180 |
-
return rx.hstack(
|
181 |
-
sidebar(),
|
182 |
-
rx.vstack(
|
183 |
-
rx.heading("Results", size="8", margin="20px 0"),
|
184 |
-
# rx.markdown(diagram),
|
185 |
-
ag_grid(
|
186 |
-
id="ag_grid_metrics",
|
187 |
-
row_data=metrics_df.to_dict("records"),
|
188 |
-
column_defs=metrics_column_defs,
|
189 |
-
width="60%",
|
190 |
-
margin="20px auto", # Center the table
|
191 |
-
),
|
192 |
-
rx.divider(),
|
193 |
-
rx.heading("Top 10 Topics: Per-Topic Performance Star Chart", size="8", margin="20px 0"),
|
194 |
-
rx.text(
|
195 |
-
"The chart below shows how each model performed across the top 10 topics by row count. "
|
196 |
-
"Each line represents a model, and the chart allows you to compare their performance across topics.",
|
197 |
-
font_size="md",
|
198 |
-
padding="10px",
|
199 |
-
),
|
200 |
-
rx.plotly(data=topic_star_chart()), # Render the radar chart
|
201 |
-
padding="20px",
|
202 |
-
|
203 |
-
)
|
204 |
-
)
|
205 |
-
|
206 |
-
|
207 |
-
# Main page of the Reflex app
|
208 |
-
def index():
|
209 |
-
return rx.hstack(
|
210 |
-
sidebar(),
|
211 |
-
rx.vstack(
|
212 |
-
rx.markdown(p1),
|
213 |
-
))
|
214 |
|
|
|
215 |
|
216 |
# Initialize the Reflex app with polished layout
|
217 |
app = rx.App()
|
218 |
-
app.add_page(overview, title='Overview', route='/overview') #
|
219 |
-
app.add_page(results, title='Results', route='/results') #
|
220 |
-
app.add_page(index) #
|
|
|
|
|
|
|
1 |
import reflex as rx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
+
from .pages import index, overview, results
|
4 |
|
5 |
# Initialize the Reflex app with polished layout
|
6 |
app = rx.App()
|
7 |
+
app.add_page(overview, title='Overview', route='/overview') # Overview page
|
8 |
+
app.add_page(results, title='Results', route='/results') # Results page
|
9 |
+
app.add_page(index, title='Index', route='/') # Index page
|