andymbryant commited on
Commit
97c1a8e
1 Parent(s): 8083da8

finished brainstorming

Browse files
requirements.txt CHANGED
@@ -1,4 +1,6 @@
1
  langchain
2
  openai
 
 
 
3
  gradio
4
- # ipykernel
 
1
  langchain
2
  openai
3
+ ipykernel
4
+ pandas
5
+ tabulate
6
  gradio
 
src/.DS_Store ADDED
Binary file (6.15 kB). View file
 
src/notebooks/brainstorm.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
src/notebooks/brainstorm2.ipynb ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 31,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "import os\n",
10
+ "import pandas as pd\n",
11
+ "import langchain\n",
12
+ "from langchain.agents import OpenAIFunctionsAgent, AgentExecutor\n",
13
+ "from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
14
+ "from langchain.tools import PythonAstREPLTool\n",
15
+ "from langchain.chat_models import ChatOpenAI\n",
16
+ "from pydantic import BaseModel, Field\n",
17
+ "from langchain.memory import ConversationBufferMemory"
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "code",
22
+ "execution_count": 32,
23
+ "metadata": {},
24
+ "outputs": [],
25
+ "source": [
26
+ "langchain.debug = True\n",
27
+ "os.environ[\"OPENAI_API_KEY\"] = \"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\"\n",
28
+ "data_dir_path = os.path.join(os.getcwd())\n",
29
+ "pd.set_option('display.max_rows', 20)\n",
30
+ "pd.set_option('display.max_columns', 20)\n",
31
+ "\n",
32
+ "NUM_ROWS_IN_HEAD = 5\n",
33
+ "\n",
34
+ "# {dataframe_heads_str}"
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "code",
39
+ "execution_count": 43,
40
+ "metadata": {},
41
+ "outputs": [],
42
+ "source": [
43
+ "PROMPT_TEMPLATE = \"\"\"You are DataMapperGPT. Your job is to work with a human, who is a data engineer, to compare multiple source dataframes and map their structures to the schema of the target dataframe.\n",
44
+ "The ultimate goal is to generate a mapping from the source dataframes to the target dataframe.\n",
45
+ "\n",
46
+ "This is the result of running `df.head().to_markdown()` on each of the dataframes:\n",
47
+ "\n",
48
+ "{dataframe_heads_str}\n",
49
+ "You can use these samples to draw conclusions about the structure of the data. Do not get more than 5 rows at a time.\n",
50
+ "\n",
51
+ "Please work step by step through this process. You can make intermediate queries, validate your logic, and then move on to the next step.\n",
52
+ "\n",
53
+ "Be precise, analytical, thorough.\n",
54
+ "\n",
55
+ "Here is a history of the conversation with the user so far:\n",
56
+ "\"\"\""
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": 44,
62
+ "metadata": {},
63
+ "outputs": [],
64
+ "source": [
65
+ "class PythonInputs(BaseModel):\n",
66
+ " query: str = Field(description=\"code snippet to run\")\n",
67
+ "\n",
68
+ "format_df_for_prompt = lambda df: f'<df>\\n{df.head(NUM_ROWS_IN_HEAD).to_markdown()}\\n</df>'\n",
69
+ "\n",
70
+ "entries_a_df = pd.read_csv(os.path.join(data_dir_path, 'legal_entries_a.csv'))\n",
71
+ "entries_b_df = pd.read_csv(os.path.join(data_dir_path, 'legal_entries_b.csv'))\n",
72
+ "template_df = pd.read_csv(os.path.join(data_dir_path, 'legal_template.csv'))\n",
73
+ "\n",
74
+ "df_name_to_df_map = {\"source_df_1\": entries_a_df, \"source_df_2\": entries_b_df, \"template_df\": template_df}\n",
75
+ "\n",
76
+ "dataframe_heads_str_list: str = []\n",
77
+ "for df_name, df in df_name_to_df_map.items():\n",
78
+ " dataframe_heads_str_list.append(f'<{df_name}>\\n{df.head(NUM_ROWS_IN_HEAD).to_markdown()}\\n</{df_name}>')\n",
79
+ "\n",
80
+ "prompt_template = PROMPT_TEMPLATE.format(dataframe_heads_str=\"\\n\\n\".join(dataframe_heads_str_list))\n",
81
+ "\n",
82
+ "prompt = ChatPromptTemplate.from_messages([\n",
83
+ " (\"system\", prompt_template),\n",
84
+ " MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n",
85
+ " (\"human\", \"{input}\")\n",
86
+ "])\n",
87
+ "memory = ConversationBufferMemory(memory_key=\"chat_history\", return_messages=True)\n",
88
+ "\n",
89
+ "repl = PythonAstREPLTool(locals=df_name_to_df_map, name=\"python_repl\",\n",
90
+ " description=\"Runs code and returns the output of the final line\",\n",
91
+ " args_schema=PythonInputs)\n",
92
+ "tools = [repl]\n",
93
+ "agent = OpenAIFunctionsAgent(llm=ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\"), prompt=prompt, tools=tools, memory=memory, handle_parsing_errors=True)\n",
94
+ "agent_executor = AgentExecutor(agent=agent, tools=tools, max_iterations=5, early_stopping_method=\"generate\", handle_parsing_errors=True)"
95
+ ]
96
+ },
97
+ {
98
+ "cell_type": "code",
99
+ "execution_count": 45,
100
+ "metadata": {},
101
+ "outputs": [
102
+ {
103
+ "name": "stdout",
104
+ "output_type": "stream",
105
+ "text": [
106
+ "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:AgentExecutor] Entering Chain run with input:\n",
107
+ "\u001b[0m{\n",
108
+ " \"input\": \"What are the key differences between the dataframe schemas?\",\n",
109
+ " \"chat_history\": []\n",
110
+ "}\n",
111
+ "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 2:llm:ChatOpenAI] Entering LLM run with input:\n",
112
+ "\u001b[0m{\n",
113
+ " \"prompts\": [\n",
114
+ " \"System: You are DataMapperGPT. Your job is to work with a human, who is a data engineer, to compare multiple source dataframes and map their structures to the schema of the target dataframe.\\nThe ultimate goal is to generate a mapping from the source dataframes to the target dataframe.\\n\\nThis is the result of running `df.head().to_markdown()` on each of the dataframes:\\n\\n<source_df_1>\\n| | case_date | lastname | firstname | case_type | case_id | court_fee | jurisdiction |\\n|---:|:------------|:-----------|:------------|:------------|:----------|------------:|:---------------|\\n| 0 | 2023-05-12 | Kim | Miguel | Civil | CR-1095 | 100 | BOSTON |\\n| 1 | 2023-04-20 | Lee | John | Criminl | CR-8597 | 150 | houston |\\n| 2 | 2023-02-10 | Smith | Dmitri | Criminal | CR-6833 | 200 | chicago |\\n| 3 | 2023-03-16 | Patel | Dmitri | Criminal | CR-2899 | 100 | BOSTON |\\n| 4 | 2023-06-15 | Ivanov | Jane | Family | CR-5997 | 200 | houston |\\n</source_df_1>\\n\\n<source_df_2>\\n| | Date_of_Case | Fee | FullName | CaseNumber | CaseKind | Location |\\n|---:|:---------------|------:|:-------------|:-------------|:-----------|:-----------|\\n| 0 | 2023/05/12 | 100 | Miguel Kim | CASE-8206 | Civil | BOST |\\n| 1 | 2023/04/20 | 150 | John Lee | CASE-4328 | Criminl | HOUST |\\n| 2 | 2023/02/10 | 200 | Dmitri Smith | CASE-1915 | Criminal | CHIC |\\n| 3 | 2023/03/16 | 100 | Dmitri Patel | CASE-4283 | Criminal | BOSTO |\\n| 4 | 2023/06/15 | 200 | Jane Ivanov | CASE-7732 | Family | HOUST |\\n</source_df_2>\\n\\n<template_df>\\n| | CaseDate | FullName | CaseType | CaseID | Fee | Jurisdiction |\\n|---:|:-----------|:-------------|:-----------|:----------|------:|:---------------|\\n| 0 | 2023-05-12 | Miguel Kim | Civil | CASE-6761 | 100 | Boston |\\n| 1 | 2023-04-20 | John Lee | Criminl | CASE-6089 | 150 | Houston |\\n| 2 | 2023-02-10 | Dmitri Smith | Criminal | CASE-9565 | 200 | Chicago |\\n| 3 | 2023-03-16 | Dmitri Patel | Criminal | CASE-6222 | 100 | Boston |\\n| 4 | 2023-06-15 | Jane Ivanov | Family | CASE-2702 | 200 | Houston |\\n</template_df>\\nYou can use these samples to draw conclusions about the structure of the data. Do not get more than 5 rows at a time.\\n\\nPlease work step by step through this process. You can make intermediate queries, validate your logic, and then move on to the next step.\\n\\nBe precise, analytical, thorough.\\n\\nHere is a history of the conversation with the user so far:\\n\\nHuman: What are the key differences between the dataframe schemas?\"\n",
115
+ " ]\n",
116
+ "}\n",
117
+ "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 2:llm:ChatOpenAI] [16.60s] Exiting LLM run with output:\n",
118
+ "\u001b[0m{\n",
119
+ " \"generations\": [\n",
120
+ " [\n",
121
+ " {\n",
122
+ " \"text\": \"\",\n",
123
+ " \"generation_info\": {\n",
124
+ " \"finish_reason\": \"function_call\"\n",
125
+ " },\n",
126
+ " \"message\": {\n",
127
+ " \"lc\": 1,\n",
128
+ " \"type\": \"constructor\",\n",
129
+ " \"id\": [\n",
130
+ " \"langchain\",\n",
131
+ " \"schema\",\n",
132
+ " \"messages\",\n",
133
+ " \"AIMessage\"\n",
134
+ " ],\n",
135
+ " \"kwargs\": {\n",
136
+ " \"content\": \"\",\n",
137
+ " \"additional_kwargs\": {\n",
138
+ " \"function_call\": {\n",
139
+ " \"name\": \"python_repl\",\n",
140
+ " \"arguments\": \"{\\n \\\"query\\\": \\\"import pandas as pd\\\\n\\\\nsource_df_1 = pd.DataFrame({'case_date': ['2023-05-12', '2023-04-20', '2023-02-10', '2023-03-16', '2023-06-15'], 'lastname': ['Kim', 'Lee', 'Smith', 'Patel', 'Ivanov'], 'firstname': ['Miguel', 'John', 'Dmitri', 'Dmitri', 'Jane'], 'case_type': ['Civil', 'Criminl', 'Criminal', 'Criminal', 'Family'], 'case_id': ['CR-1095', 'CR-8597', 'CR-6833', 'CR-2899', 'CR-5997'], 'court_fee': [100, 150, 200, 100, 200], 'jurisdiction': ['BOSTON', 'houston', 'chicago', 'BOSTON', 'houston']})\\\\n\\\\nsource_df_2 = pd.DataFrame({'Date_of_Case': ['2023/05/12', '2023/04/20', '2023/02/10', '2023/03/16', '2023/06/15'], 'Fee': [100, 150, 200, 100, 200], 'FullName': ['Miguel Kim', 'John Lee', 'Dmitri Smith', 'Dmitri Patel', 'Jane Ivanov'], 'CaseNumber': ['CASE-8206', 'CASE-4328', 'CASE-1915', 'CASE-4283', 'CASE-7732'], 'CaseKind': ['Civil', 'Criminl', 'Criminal', 'Criminal', 'Family'], 'Location': ['BOST', 'HOUST', 'CHIC', 'BOSTO', 'HOUST']})\\\\n\\\\ntemplate_df = pd.DataFrame({'CaseDate': ['2023-05-12', '2023-04-20', '2023-02-10', '2023-03-16', '2023-06-15'], 'FullName': ['Miguel Kim', 'John Lee', 'Dmitri Smith', 'Dmitri Patel', 'Jane Ivanov'], 'CaseType': ['Civil', 'Criminl', 'Criminal', 'Criminal', 'Family'], 'CaseID': ['CASE-6761', 'CASE-6089', 'CASE-9565', 'CASE-6222', 'CASE-2702'], 'Fee': [100, 150, 200, 100, 200], 'Jurisdiction': ['Boston', 'Houston', 'Chicago', 'Boston', 'Houston']})\\\\n\\\\nsource_df_1.head().to_markdown()\\\"\\n}\"\n",
141
+ " }\n",
142
+ " }\n",
143
+ " }\n",
144
+ " }\n",
145
+ " }\n",
146
+ " ]\n",
147
+ " ],\n",
148
+ " \"llm_output\": {\n",
149
+ " \"token_usage\": {\n",
150
+ " \"prompt_tokens\": 932,\n",
151
+ " \"completion_tokens\": 599,\n",
152
+ " \"total_tokens\": 1531\n",
153
+ " },\n",
154
+ " \"model_name\": \"gpt-3.5-turbo-0613\"\n",
155
+ " },\n",
156
+ " \"run\": null\n",
157
+ "}\n",
158
+ "\u001b[32;1m\u001b[1;3m[tool/start]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 3:tool:python_repl] Entering Tool run with input:\n",
159
+ "\u001b[0m\"{'query': \"import pandas as pd\\n\\nsource_df_1 = pd.DataFrame({'case_date': ['2023-05-12', '2023-04-20', '2023-02-10', '2023-03-16', '2023-06-15'], 'lastname': ['Kim', 'Lee', 'Smith', 'Patel', 'Ivanov'], 'firstname': ['Miguel', 'John', 'Dmitri', 'Dmitri', 'Jane'], 'case_type': ['Civil', 'Criminl', 'Criminal', 'Criminal', 'Family'], 'case_id': ['CR-1095', 'CR-8597', 'CR-6833', 'CR-2899', 'CR-5997'], 'court_fee': [100, 150, 200, 100, 200], 'jurisdiction': ['BOSTON', 'houston', 'chicago', 'BOSTON', 'houston']})\\n\\nsource_df_2 = pd.DataFrame({'Date_of_Case': ['2023/05/12', '2023/04/20', '2023/02/10', '2023/03/16', '2023/06/15'], 'Fee': [100, 150, 200, 100, 200], 'FullName': ['Miguel Kim', 'John Lee', 'Dmitri Smith', 'Dmitri Patel', 'Jane Ivanov'], 'CaseNumber': ['CASE-8206', 'CASE-4328', 'CASE-1915', 'CASE-4283', 'CASE-7732'], 'CaseKind': ['Civil', 'Criminl', 'Criminal', 'Criminal', 'Family'], 'Location': ['BOST', 'HOUST', 'CHIC', 'BOSTO', 'HOUST']})\\n\\ntemplate_df = pd.DataFrame({'CaseDate': ['2023-05-12', '2023-04-20', '2023-02-10', '2023-03-16', '2023-06-15'], 'FullName': ['Miguel Kim', 'John Lee', 'Dmitri Smith', 'Dmitri Patel', 'Jane Ivanov'], 'CaseType': ['Civil', 'Criminl', 'Criminal', 'Criminal', 'Family'], 'CaseID': ['CASE-6761', 'CASE-6089', 'CASE-9565', 'CASE-6222', 'CASE-2702'], 'Fee': [100, 150, 200, 100, 200], 'Jurisdiction': ['Boston', 'Houston', 'Chicago', 'Boston', 'Houston']})\\n\\nsource_df_1.head().to_markdown()\"}\"\n",
160
+ "\u001b[36;1m\u001b[1;3m[tool/end]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 3:tool:python_repl] [7ms] Exiting Tool run with output:\n",
161
+ "\u001b[0m\"| | case_date | lastname | firstname | case_type | case_id | court_fee | jurisdiction |\n",
162
+ "|---:|:------------|:-----------|:------------|:------------|:----------|------------:|:---------------|\n",
163
+ "| 0 | 2023-05-12 | Kim | Miguel | Civil | CR-1095 | 100 | BOSTON |\n",
164
+ "| 1 | 2023-04-20 | Lee | John | Criminl | CR-8597 | 150 | houston |\n",
165
+ "| 2 | 2023-02-10 | Smith | Dmitri | Criminal | CR-6833 | 200 | chicago |\n",
166
+ "| 3 | 2023-03-16 | Patel | Dmitri | Criminal | CR-2899 | 100 | BOSTON |\n",
167
+ "| 4 | 2023-06-15 | Ivanov | Jane | Family | CR-5997 | 200 | houston |\"\n",
168
+ "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 4:llm:ChatOpenAI] Entering LLM run with input:\n",
169
+ "\u001b[0m{\n",
170
+ " \"prompts\": [\n",
171
+ " \"System: You are DataMapperGPT. Your job is to work with a human, who is a data engineer, to compare multiple source dataframes and map their structures to the schema of the target dataframe.\\nThe ultimate goal is to generate a mapping from the source dataframes to the target dataframe.\\n\\nThis is the result of running `df.head().to_markdown()` on each of the dataframes:\\n\\n<source_df_1>\\n| | case_date | lastname | firstname | case_type | case_id | court_fee | jurisdiction |\\n|---:|:------------|:-----------|:------------|:------------|:----------|------------:|:---------------|\\n| 0 | 2023-05-12 | Kim | Miguel | Civil | CR-1095 | 100 | BOSTON |\\n| 1 | 2023-04-20 | Lee | John | Criminl | CR-8597 | 150 | houston |\\n| 2 | 2023-02-10 | Smith | Dmitri | Criminal | CR-6833 | 200 | chicago |\\n| 3 | 2023-03-16 | Patel | Dmitri | Criminal | CR-2899 | 100 | BOSTON |\\n| 4 | 2023-06-15 | Ivanov | Jane | Family | CR-5997 | 200 | houston |\\n</source_df_1>\\n\\n<source_df_2>\\n| | Date_of_Case | Fee | FullName | CaseNumber | CaseKind | Location |\\n|---:|:---------------|------:|:-------------|:-------------|:-----------|:-----------|\\n| 0 | 2023/05/12 | 100 | Miguel Kim | CASE-8206 | Civil | BOST |\\n| 1 | 2023/04/20 | 150 | John Lee | CASE-4328 | Criminl | HOUST |\\n| 2 | 2023/02/10 | 200 | Dmitri Smith | CASE-1915 | Criminal | CHIC |\\n| 3 | 2023/03/16 | 100 | Dmitri Patel | CASE-4283 | Criminal | BOSTO |\\n| 4 | 2023/06/15 | 200 | Jane Ivanov | CASE-7732 | Family | HOUST |\\n</source_df_2>\\n\\n<template_df>\\n| | CaseDate | FullName | CaseType | CaseID | Fee | Jurisdiction |\\n|---:|:-----------|:-------------|:-----------|:----------|------:|:---------------|\\n| 0 | 2023-05-12 | Miguel Kim | Civil | CASE-6761 | 100 | Boston |\\n| 1 | 2023-04-20 | John Lee | Criminl | CASE-6089 | 150 | Houston |\\n| 2 | 2023-02-10 | Dmitri Smith | Criminal | CASE-9565 | 200 | Chicago |\\n| 3 | 2023-03-16 | Dmitri Patel | Criminal | CASE-6222 | 100 | Boston |\\n| 4 | 2023-06-15 | Jane Ivanov | Family | CASE-2702 | 200 | Houston |\\n</template_df>\\nYou can use these samples to draw conclusions about the structure of the data. Do not get more than 5 rows at a time.\\n\\nPlease work step by step through this process. You can make intermediate queries, validate your logic, and then move on to the next step.\\n\\nBe precise, analytical, thorough.\\n\\nHere is a history of the conversation with the user so far:\\n\\nAI: {'name': 'python_repl', 'arguments': '{\\\\n \\\"query\\\": \\\"import pandas as pd\\\\\\\\n\\\\\\\\nsource_df_1 = pd.DataFrame({\\\\'case_date\\\\': [\\\\'2023-05-12\\\\', \\\\'2023-04-20\\\\', \\\\'2023-02-10\\\\', \\\\'2023-03-16\\\\', \\\\'2023-06-15\\\\'], \\\\'lastname\\\\': [\\\\'Kim\\\\', \\\\'Lee\\\\', \\\\'Smith\\\\', \\\\'Patel\\\\', \\\\'Ivanov\\\\'], \\\\'firstname\\\\': [\\\\'Miguel\\\\', \\\\'John\\\\', \\\\'Dmitri\\\\', \\\\'Dmitri\\\\', \\\\'Jane\\\\'], \\\\'case_type\\\\': [\\\\'Civil\\\\', \\\\'Criminl\\\\', \\\\'Criminal\\\\', \\\\'Criminal\\\\', \\\\'Family\\\\'], \\\\'case_id\\\\': [\\\\'CR-1095\\\\', \\\\'CR-8597\\\\', \\\\'CR-6833\\\\', \\\\'CR-2899\\\\', \\\\'CR-5997\\\\'], \\\\'court_fee\\\\': [100, 150, 200, 100, 200], \\\\'jurisdiction\\\\': [\\\\'BOSTON\\\\', \\\\'houston\\\\', \\\\'chicago\\\\', \\\\'BOSTON\\\\', \\\\'houston\\\\']})\\\\\\\\n\\\\\\\\nsource_df_2 = pd.DataFrame({\\\\'Date_of_Case\\\\': [\\\\'2023/05/12\\\\', \\\\'2023/04/20\\\\', \\\\'2023/02/10\\\\', \\\\'2023/03/16\\\\', \\\\'2023/06/15\\\\'], \\\\'Fee\\\\': [100, 150, 200, 100, 200], \\\\'FullName\\\\': [\\\\'Miguel Kim\\\\', \\\\'John Lee\\\\', \\\\'Dmitri Smith\\\\', \\\\'Dmitri Patel\\\\', \\\\'Jane Ivanov\\\\'], \\\\'CaseNumber\\\\': [\\\\'CASE-8206\\\\', \\\\'CASE-4328\\\\', \\\\'CASE-1915\\\\', \\\\'CASE-4283\\\\', \\\\'CASE-7732\\\\'], \\\\'CaseKind\\\\': [\\\\'Civil\\\\', \\\\'Criminl\\\\', \\\\'Criminal\\\\', \\\\'Criminal\\\\', \\\\'Family\\\\'], \\\\'Location\\\\': [\\\\'BOST\\\\', \\\\'HOUST\\\\', \\\\'CHIC\\\\', \\\\'BOSTO\\\\', \\\\'HOUST\\\\']})\\\\\\\\n\\\\\\\\ntemplate_df = pd.DataFrame({\\\\'CaseDate\\\\': [\\\\'2023-05-12\\\\', \\\\'2023-04-20\\\\', \\\\'2023-02-10\\\\', \\\\'2023-03-16\\\\', \\\\'2023-06-15\\\\'], \\\\'FullName\\\\': [\\\\'Miguel Kim\\\\', \\\\'John Lee\\\\', \\\\'Dmitri Smith\\\\', \\\\'Dmitri Patel\\\\', \\\\'Jane Ivanov\\\\'], \\\\'CaseType\\\\': [\\\\'Civil\\\\', \\\\'Criminl\\\\', \\\\'Criminal\\\\', \\\\'Criminal\\\\', \\\\'Family\\\\'], \\\\'CaseID\\\\': [\\\\'CASE-6761\\\\', \\\\'CASE-6089\\\\', \\\\'CASE-9565\\\\', \\\\'CASE-6222\\\\', \\\\'CASE-2702\\\\'], \\\\'Fee\\\\': [100, 150, 200, 100, 200], \\\\'Jurisdiction\\\\': [\\\\'Boston\\\\', \\\\'Houston\\\\', \\\\'Chicago\\\\', \\\\'Boston\\\\', \\\\'Houston\\\\']})\\\\\\\\n\\\\\\\\nsource_df_1.head().to_markdown()\\\"\\\\n}'}\\nFunction: | | case_date | lastname | firstname | case_type | case_id | court_fee | jurisdiction |\\n|---:|:------------|:-----------|:------------|:------------|:----------|------------:|:---------------|\\n| 0 | 2023-05-12 | Kim | Miguel | Civil | CR-1095 | 100 | BOSTON |\\n| 1 | 2023-04-20 | Lee | John | Criminl | CR-8597 | 150 | houston |\\n| 2 | 2023-02-10 | Smith | Dmitri | Criminal | CR-6833 | 200 | chicago |\\n| 3 | 2023-03-16 | Patel | Dmitri | Criminal | CR-2899 | 100 | BOSTON |\\n| 4 | 2023-06-15 | Ivanov | Jane | Family | CR-5997 | 200 | houston |\\nHuman: What are the key differences between the dataframe schemas?\"\n",
172
+ " ]\n",
173
+ "}\n",
174
+ "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 4:llm:ChatOpenAI] [1.18s] Exiting LLM run with output:\n",
175
+ "\u001b[0m{\n",
176
+ " \"generations\": [\n",
177
+ " [\n",
178
+ " {\n",
179
+ " \"text\": \"\",\n",
180
+ " \"generation_info\": {\n",
181
+ " \"finish_reason\": \"function_call\"\n",
182
+ " },\n",
183
+ " \"message\": {\n",
184
+ " \"lc\": 1,\n",
185
+ " \"type\": \"constructor\",\n",
186
+ " \"id\": [\n",
187
+ " \"langchain\",\n",
188
+ " \"schema\",\n",
189
+ " \"messages\",\n",
190
+ " \"AIMessage\"\n",
191
+ " ],\n",
192
+ " \"kwargs\": {\n",
193
+ " \"content\": \"\",\n",
194
+ " \"additional_kwargs\": {\n",
195
+ " \"function_call\": {\n",
196
+ " \"name\": \"python_repl\",\n",
197
+ " \"arguments\": \"{\\n \\\"query\\\": \\\"set(source_df_1.columns) - set(template_df.columns)\\\"\\n}\"\n",
198
+ " }\n",
199
+ " }\n",
200
+ " }\n",
201
+ " }\n",
202
+ " }\n",
203
+ " ]\n",
204
+ " ],\n",
205
+ " \"llm_output\": {\n",
206
+ " \"token_usage\": {\n",
207
+ " \"prompt_tokens\": 1784,\n",
208
+ " \"completion_tokens\": 27,\n",
209
+ " \"total_tokens\": 1811\n",
210
+ " },\n",
211
+ " \"model_name\": \"gpt-3.5-turbo-0613\"\n",
212
+ " },\n",
213
+ " \"run\": null\n",
214
+ "}\n",
215
+ "\u001b[32;1m\u001b[1;3m[tool/start]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 5:tool:python_repl] Entering Tool run with input:\n",
216
+ "\u001b[0m\"{'query': 'set(source_df_1.columns) - set(template_df.columns)'}\"\n",
217
+ "\u001b[36;1m\u001b[1;3m[tool/end]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 5:tool:python_repl] [0ms] Exiting Tool run with output:\n",
218
+ "\u001b[0m\"{'case_id', 'firstname', 'court_fee', 'case_type', 'lastname', 'case_date', 'jurisdiction'}\"\n",
219
+ "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 6:llm:ChatOpenAI] Entering LLM run with input:\n",
220
+ "\u001b[0m{\n",
221
+ " \"prompts\": [\n",
222
+ " \"System: You are DataMapperGPT. Your job is to work with a human, who is a data engineer, to compare multiple source dataframes and map their structures to the schema of the target dataframe.\\nThe ultimate goal is to generate a mapping from the source dataframes to the target dataframe.\\n\\nThis is the result of running `df.head().to_markdown()` on each of the dataframes:\\n\\n<source_df_1>\\n| | case_date | lastname | firstname | case_type | case_id | court_fee | jurisdiction |\\n|---:|:------------|:-----------|:------------|:------------|:----------|------------:|:---------------|\\n| 0 | 2023-05-12 | Kim | Miguel | Civil | CR-1095 | 100 | BOSTON |\\n| 1 | 2023-04-20 | Lee | John | Criminl | CR-8597 | 150 | houston |\\n| 2 | 2023-02-10 | Smith | Dmitri | Criminal | CR-6833 | 200 | chicago |\\n| 3 | 2023-03-16 | Patel | Dmitri | Criminal | CR-2899 | 100 | BOSTON |\\n| 4 | 2023-06-15 | Ivanov | Jane | Family | CR-5997 | 200 | houston |\\n</source_df_1>\\n\\n<source_df_2>\\n| | Date_of_Case | Fee | FullName | CaseNumber | CaseKind | Location |\\n|---:|:---------------|------:|:-------------|:-------------|:-----------|:-----------|\\n| 0 | 2023/05/12 | 100 | Miguel Kim | CASE-8206 | Civil | BOST |\\n| 1 | 2023/04/20 | 150 | John Lee | CASE-4328 | Criminl | HOUST |\\n| 2 | 2023/02/10 | 200 | Dmitri Smith | CASE-1915 | Criminal | CHIC |\\n| 3 | 2023/03/16 | 100 | Dmitri Patel | CASE-4283 | Criminal | BOSTO |\\n| 4 | 2023/06/15 | 200 | Jane Ivanov | CASE-7732 | Family | HOUST |\\n</source_df_2>\\n\\n<template_df>\\n| | CaseDate | FullName | CaseType | CaseID | Fee | Jurisdiction |\\n|---:|:-----------|:-------------|:-----------|:----------|------:|:---------------|\\n| 0 | 2023-05-12 | Miguel Kim | Civil | CASE-6761 | 100 | Boston |\\n| 1 | 2023-04-20 | John Lee | Criminl | CASE-6089 | 150 | Houston |\\n| 2 | 2023-02-10 | Dmitri Smith | Criminal | CASE-9565 | 200 | Chicago |\\n| 3 | 2023-03-16 | Dmitri Patel | Criminal | CASE-6222 | 100 | Boston |\\n| 4 | 2023-06-15 | Jane Ivanov | Family | CASE-2702 | 200 | Houston |\\n</template_df>\\nYou can use these samples to draw conclusions about the structure of the data. Do not get more than 5 rows at a time.\\n\\nPlease work step by step through this process. You can make intermediate queries, validate your logic, and then move on to the next step.\\n\\nBe precise, analytical, thorough.\\n\\nHere is a history of the conversation with the user so far:\\n\\nAI: {'name': 'python_repl', 'arguments': '{\\\\n \\\"query\\\": \\\"import pandas as pd\\\\\\\\n\\\\\\\\nsource_df_1 = pd.DataFrame({\\\\'case_date\\\\': [\\\\'2023-05-12\\\\', \\\\'2023-04-20\\\\', \\\\'2023-02-10\\\\', \\\\'2023-03-16\\\\', \\\\'2023-06-15\\\\'], \\\\'lastname\\\\': [\\\\'Kim\\\\', \\\\'Lee\\\\', \\\\'Smith\\\\', \\\\'Patel\\\\', \\\\'Ivanov\\\\'], \\\\'firstname\\\\': [\\\\'Miguel\\\\', \\\\'John\\\\', \\\\'Dmitri\\\\', \\\\'Dmitri\\\\', \\\\'Jane\\\\'], \\\\'case_type\\\\': [\\\\'Civil\\\\', \\\\'Criminl\\\\', \\\\'Criminal\\\\', \\\\'Criminal\\\\', \\\\'Family\\\\'], \\\\'case_id\\\\': [\\\\'CR-1095\\\\', \\\\'CR-8597\\\\', \\\\'CR-6833\\\\', \\\\'CR-2899\\\\', \\\\'CR-5997\\\\'], \\\\'court_fee\\\\': [100, 150, 200, 100, 200], \\\\'jurisdiction\\\\': [\\\\'BOSTON\\\\', \\\\'houston\\\\', \\\\'chicago\\\\', \\\\'BOSTON\\\\', \\\\'houston\\\\']})\\\\\\\\n\\\\\\\\nsource_df_2 = pd.DataFrame({\\\\'Date_of_Case\\\\': [\\\\'2023/05/12\\\\', \\\\'2023/04/20\\\\', \\\\'2023/02/10\\\\', \\\\'2023/03/16\\\\', \\\\'2023/06/15\\\\'], \\\\'Fee\\\\': [100, 150, 200, 100, 200], \\\\'FullName\\\\': [\\\\'Miguel Kim\\\\', \\\\'John Lee\\\\', \\\\'Dmitri Smith\\\\', \\\\'Dmitri Patel\\\\', \\\\'Jane Ivanov\\\\'], \\\\'CaseNumber\\\\': [\\\\'CASE-8206\\\\', \\\\'CASE-4328\\\\', \\\\'CASE-1915\\\\', \\\\'CASE-4283\\\\', \\\\'CASE-7732\\\\'], \\\\'CaseKind\\\\': [\\\\'Civil\\\\', \\\\'Criminl\\\\', \\\\'Criminal\\\\', \\\\'Criminal\\\\', \\\\'Family\\\\'], \\\\'Location\\\\': [\\\\'BOST\\\\', \\\\'HOUST\\\\', \\\\'CHIC\\\\', \\\\'BOSTO\\\\', \\\\'HOUST\\\\']})\\\\\\\\n\\\\\\\\ntemplate_df = pd.DataFrame({\\\\'CaseDate\\\\': [\\\\'2023-05-12\\\\', \\\\'2023-04-20\\\\', \\\\'2023-02-10\\\\', \\\\'2023-03-16\\\\', \\\\'2023-06-15\\\\'], \\\\'FullName\\\\': [\\\\'Miguel Kim\\\\', \\\\'John Lee\\\\', \\\\'Dmitri Smith\\\\', \\\\'Dmitri Patel\\\\', \\\\'Jane Ivanov\\\\'], \\\\'CaseType\\\\': [\\\\'Civil\\\\', \\\\'Criminl\\\\', \\\\'Criminal\\\\', \\\\'Criminal\\\\', \\\\'Family\\\\'], \\\\'CaseID\\\\': [\\\\'CASE-6761\\\\', \\\\'CASE-6089\\\\', \\\\'CASE-9565\\\\', \\\\'CASE-6222\\\\', \\\\'CASE-2702\\\\'], \\\\'Fee\\\\': [100, 150, 200, 100, 200], \\\\'Jurisdiction\\\\': [\\\\'Boston\\\\', \\\\'Houston\\\\', \\\\'Chicago\\\\', \\\\'Boston\\\\', \\\\'Houston\\\\']})\\\\\\\\n\\\\\\\\nsource_df_1.head().to_markdown()\\\"\\\\n}'}\\nFunction: | | case_date | lastname | firstname | case_type | case_id | court_fee | jurisdiction |\\n|---:|:------------|:-----------|:------------|:------------|:----------|------------:|:---------------|\\n| 0 | 2023-05-12 | Kim | Miguel | Civil | CR-1095 | 100 | BOSTON |\\n| 1 | 2023-04-20 | Lee | John | Criminl | CR-8597 | 150 | houston |\\n| 2 | 2023-02-10 | Smith | Dmitri | Criminal | CR-6833 | 200 | chicago |\\n| 3 | 2023-03-16 | Patel | Dmitri | Criminal | CR-2899 | 100 | BOSTON |\\n| 4 | 2023-06-15 | Ivanov | Jane | Family | CR-5997 | 200 | houston |\\nAI: {'name': 'python_repl', 'arguments': '{\\\\n \\\"query\\\": \\\"set(source_df_1.columns) - set(template_df.columns)\\\"\\\\n}'}\\nFunction: {'case_id', 'firstname', 'court_fee', 'case_type', 'lastname', 'case_date', 'jurisdiction'}\\nHuman: What are the key differences between the dataframe schemas?\"\n",
223
+ " ]\n",
224
+ "}\n",
225
+ "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[1:chain:AgentExecutor > 6:llm:ChatOpenAI] [8.40s] Exiting LLM run with output:\n",
226
+ "\u001b[0m{\n",
227
+ " \"generations\": [\n",
228
+ " [\n",
229
+ " {\n",
230
+ " \"text\": \"The key differences between the dataframe schemas are as follows:\\n\\n1. The column names in `source_df_1` are different from the column names in `template_df`. The column names in `source_df_1` are: `case_date`, `lastname`, `firstname`, `case_type`, `case_id`, `court_fee`, and `jurisdiction`. The corresponding column names in `template_df` are: `CaseDate`, `FullName`, `CaseType`, `CaseID`, `Fee`, and `Jurisdiction`.\\n\\n2. The order of the columns is different between `source_df_1` and `template_df`.\\n\\n3. The values in the `case_date` column of `source_df_1` are in the format 'YYYY-MM-DD', while the values in the `CaseDate` column of `template_df` are in the format 'YYYY-MM-DD'.\\n\\n4. The values in the `court_fee` column of `source_df_1` are integers, while the values in the `Fee` column of `template_df` are also integers.\\n\\n5. The values in the `jurisdiction` column of `source_df_1` are in uppercase, while the values in the `Jurisdiction` column of `template_df` are in title case.\\n\\nThese are the key differences between the dataframe schemas.\",\n",
231
+ " \"generation_info\": {\n",
232
+ " \"finish_reason\": \"stop\"\n",
233
+ " },\n",
234
+ " \"message\": {\n",
235
+ " \"lc\": 1,\n",
236
+ " \"type\": \"constructor\",\n",
237
+ " \"id\": [\n",
238
+ " \"langchain\",\n",
239
+ " \"schema\",\n",
240
+ " \"messages\",\n",
241
+ " \"AIMessage\"\n",
242
+ " ],\n",
243
+ " \"kwargs\": {\n",
244
+ " \"content\": \"The key differences between the dataframe schemas are as follows:\\n\\n1. The column names in `source_df_1` are different from the column names in `template_df`. The column names in `source_df_1` are: `case_date`, `lastname`, `firstname`, `case_type`, `case_id`, `court_fee`, and `jurisdiction`. The corresponding column names in `template_df` are: `CaseDate`, `FullName`, `CaseType`, `CaseID`, `Fee`, and `Jurisdiction`.\\n\\n2. The order of the columns is different between `source_df_1` and `template_df`.\\n\\n3. The values in the `case_date` column of `source_df_1` are in the format 'YYYY-MM-DD', while the values in the `CaseDate` column of `template_df` are in the format 'YYYY-MM-DD'.\\n\\n4. The values in the `court_fee` column of `source_df_1` are integers, while the values in the `Fee` column of `template_df` are also integers.\\n\\n5. The values in the `jurisdiction` column of `source_df_1` are in uppercase, while the values in the `Jurisdiction` column of `template_df` are in title case.\\n\\nThese are the key differences between the dataframe schemas.\",\n",
245
+ " \"additional_kwargs\": {}\n",
246
+ " }\n",
247
+ " }\n",
248
+ " }\n",
249
+ " ]\n",
250
+ " ],\n",
251
+ " \"llm_output\": {\n",
252
+ " \"token_usage\": {\n",
253
+ " \"prompt_tokens\": 1846,\n",
254
+ " \"completion_tokens\": 271,\n",
255
+ " \"total_tokens\": 2117\n",
256
+ " },\n",
257
+ " \"model_name\": \"gpt-3.5-turbo-0613\"\n",
258
+ " },\n",
259
+ " \"run\": null\n",
260
+ "}\n",
261
+ "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:AgentExecutor] [26.19s] Exiting Chain run with output:\n",
262
+ "\u001b[0m{\n",
263
+ " \"output\": \"The key differences between the dataframe schemas are as follows:\\n\\n1. The column names in `source_df_1` are different from the column names in `template_df`. The column names in `source_df_1` are: `case_date`, `lastname`, `firstname`, `case_type`, `case_id`, `court_fee`, and `jurisdiction`. The corresponding column names in `template_df` are: `CaseDate`, `FullName`, `CaseType`, `CaseID`, `Fee`, and `Jurisdiction`.\\n\\n2. The order of the columns is different between `source_df_1` and `template_df`.\\n\\n3. The values in the `case_date` column of `source_df_1` are in the format 'YYYY-MM-DD', while the values in the `CaseDate` column of `template_df` are in the format 'YYYY-MM-DD'.\\n\\n4. The values in the `court_fee` column of `source_df_1` are integers, while the values in the `Fee` column of `template_df` are also integers.\\n\\n5. The values in the `jurisdiction` column of `source_df_1` are in uppercase, while the values in the `Jurisdiction` column of `template_df` are in title case.\\n\\nThese are the key differences between the dataframe schemas.\"\n",
264
+ "}\n"
265
+ ]
266
+ }
267
+ ],
268
+ "source": [
269
+ "question = \"What are the key differences between the dataframe schemas?\"\n",
270
+ "res = agent_executor.run(input=question, chat_history=memory.chat_memory.messages)\n",
271
+ "memory.chat_memory.add_user_message(question)\n",
272
+ "memory.chat_memory.add_ai_message(res)"
273
+ ]
274
+ },
275
+ {
276
+ "cell_type": "code",
277
+ "execution_count": null,
278
+ "metadata": {},
279
+ "outputs": [],
280
+ "source": [
281
+ "get_differences_between_dataframes\n"
282
+ ]
283
+ }
284
+ ],
285
+ "metadata": {
286
+ "kernelspec": {
287
+ "display_name": "venv",
288
+ "language": "python",
289
+ "name": "python3"
290
+ },
291
+ "language_info": {
292
+ "codemirror_mode": {
293
+ "name": "ipython",
294
+ "version": 3
295
+ },
296
+ "file_extension": ".py",
297
+ "mimetype": "text/x-python",
298
+ "name": "python",
299
+ "nbconvert_exporter": "python",
300
+ "pygments_lexer": "ipython3",
301
+ "version": "3.9.6"
302
+ },
303
+ "orig_nbformat": 4
304
+ },
305
+ "nbformat": 4,
306
+ "nbformat_minor": 2
307
+ }
src/notebooks/brainstorm3.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
src/notebooks/brainstorm4.ipynb ADDED
@@ -0,0 +1,509 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 4,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "import os\n",
10
+ "import pandas as pd\n",
11
+ "import gradio as gr\n",
12
+ "from pydantic import BaseModel, Field\n",
13
+ "\n",
14
+ "import langchain\n",
15
+ "from langchain.output_parsers import PydanticOutputParser\n",
16
+ "from langchain.prompts import ChatPromptTemplate\n",
17
+ "from langchain.prompts import ChatPromptTemplate\n",
18
+ "from langchain.tools import PythonAstREPLTool\n",
19
+ "from langchain.chat_models import ChatOpenAI\n",
20
+ "from langchain.schema.output_parser import StrOutputParser"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": 8,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "langchain.debug = False\n",
30
+ "# Throwaway key with strict usage limit\n",
31
+ "os.environ[\"OPENAI_API_KEY\"] = \"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\"\n",
32
+ "pd.set_option('display.max_columns', 20)\n",
33
+ "pd.set_option('display.max_rows', 20)"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "code",
38
+ "execution_count": 9,
39
+ "metadata": {},
40
+ "outputs": [],
41
+ "source": [
42
+ "data_dir_path = os.path.join(os.getcwd(), 'data')\n",
43
+ "NUM_ROWS_TO_RETURN = 5\n",
44
+ "\n",
45
+ "table_1_df = pd.read_csv(os.path.join(data_dir_path, 'legal_entries_a.csv'))\n",
46
+ "table_2_df = pd.read_csv(os.path.join(data_dir_path, 'legal_entries_b.csv'))\n",
47
+ "template_df = pd.read_csv(os.path.join(data_dir_path, 'legal_template.csv'))"
48
+ ]
49
+ },
50
+ {
51
+ "cell_type": "code",
52
+ "execution_count": 10,
53
+ "metadata": {},
54
+ "outputs": [],
55
+ "source": [
56
+ "transform_model = ChatOpenAI(\n",
57
+ " model_name='gpt-4',\n",
58
+ " temperature=0,\n",
59
+ ")\n",
60
+ "\n",
61
+ "natural_language_model = ChatOpenAI(\n",
62
+ " model_name='gpt-4',\n",
63
+ " temperature=0.1,\n",
64
+ ")"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "code",
69
+ "execution_count": 11,
70
+ "metadata": {},
71
+ "outputs": [],
72
+ "source": [
73
+ "# TODO: add validation to models, coupled with retry mechanism in chain\n",
74
+ "class TableMappingEntry(BaseModel):\n",
75
+ " '''A single row in a table mapping. Describes how a single column in a source table maps to a single column in a target table, including any necessary transformations, and their explanations.'''\n",
76
+ " source_column_name: str = Field(..., description=\"Name of the column in the source table.\")\n",
77
+ " target_column_name: str = Field(..., description=\"Name of the column in the target table, to which the source column maps.\")\n",
78
+ " value_transformations: str = Field(..., description=\"Transformations needed make the source values match the target values. If unncecessary, write 'NO_TRANSFORM'.\")\n",
79
+ " explanation: str = Field(..., description=\"One-sentence explanation of this row (source-target mapping/transformation). Include any information that might be relevant to a software engineer building an ETL pipeline with this document.\")\n",
80
+ "\n",
81
+ "class TableMapping(BaseModel):\n",
82
+ " '''A list of table mappings collectively describe how a source table should be transformed to match the schema of a target table.'''\n",
83
+ " table_mappings: list[TableMappingEntry] = Field(..., description=\"A list of table mappings.\")\n",
84
+ " \n",
85
+ "analyst_prompt_str = '''\n",
86
+ "You are a Data Scientist, who specializes in generating schema mappings for use by Software Engineers in ETL pipelines.\n",
87
+ "\n",
88
+ "Head of `source_csv`:\n",
89
+ "\n",
90
+ "{source_1_csv_str}\n",
91
+ "\n",
92
+ "Head of `target_csv`:\n",
93
+ "\n",
94
+ "{target_csv_str}\n",
95
+ "\n",
96
+ "Your job is to generate a thorough, precise summary of how `source_csv` should be transformed to adhere exactly to the `target_csv` schema.\n",
97
+ "\n",
98
+ "For each column in the `source_csv`, you must communicate which column in the `target_csv` it maps to, and how the values in the `source_csv` column should be transformed to match those in the `target_csv`.\n",
99
+ "You can assume the rows are aligned: that is, the first row in `source_csv` corresponds to the first row in `target_csv`, and so on.\n",
100
+ "\n",
101
+ "Remember:\n",
102
+ "1. Which column in `target_csv` it maps to. You should consider the semantic meaning of the columns, not just the character similarity. \n",
103
+ "\n",
104
+ "Example mappings:\n",
105
+ "- 'MunICipality' in `source_csv` should map to 'City' in `target_csv`.\n",
106
+ "- 'fullname' in `source_csv` should map to both 'FirstName' and 'LastName' in `target_csv`. You must explain this transformation, as well, including the target sequencing of first and last name.\n",
107
+ "\n",
108
+ "Example transformations:\n",
109
+ "- If date in `source_csv` is `2020-01-01` and date in `target_csv` is `01/01/2020`, explain exactly how this should be transformed and the reasoning behind it.\n",
110
+ "- If city in `source_csv` is `New York` and city in `target_csv` is `NEW YORK` or `NYC`, explain exactly how this should be transformed and the reasoning behind it.\n",
111
+ "\n",
112
+ "Lastly, point out any other oddities, such as duplicate columns, erroneous columns, etc.\n",
113
+ "\n",
114
+ "{format_instructions}\n",
115
+ "\n",
116
+ "Remember:\n",
117
+ "- Be concise: you are speaking to engineers, not customers.\n",
118
+ "- Be precise: all of these values are case sensitive. Consider casing for city names, exact prefixes for identifiers, ordering of people's names, etc.\n",
119
+ "- DO NOT include commas, quotes, or any other characters that might interfere with JSON serialization or CSV generation\n",
120
+ "\n",
121
+ "Your response:\n",
122
+ "'''\n",
123
+ "\n",
124
+ "def get_data_str_from_df_for_prompt(df, use_head=True, num_rows_to_return=NUM_ROWS_TO_RETURN):\n",
125
+ " data = df.head(num_rows_to_return) if use_head else df.tail(num_rows_to_return)\n",
126
+ " return f'<df>\\n{data.to_markdown()}\\n</df>'\n",
127
+ "\n",
128
+ "table_mapping_parser = PydanticOutputParser(pydantic_object=TableMapping)\n",
129
+ "analyst_prompt = ChatPromptTemplate.from_template(\n",
130
+ " template=analyst_prompt_str, \n",
131
+ " partial_variables={'format_instructions': table_mapping_parser.get_format_instructions()},\n",
132
+ ")\n",
133
+ "\n",
134
+ "mapping_chain = analyst_prompt | transform_model | table_mapping_parser\n",
135
+ "table_mapping: TableMapping = mapping_chain.invoke({\"source_1_csv_str\": get_data_str_from_df_for_prompt(table_1_df), \"target_csv_str\": get_data_str_from_df_for_prompt(template_df)})"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "code",
140
+ "execution_count": 12,
141
+ "metadata": {},
142
+ "outputs": [],
143
+ "source": [
144
+ "# spec writer\n",
145
+ "spec_writer_prompt_str = '''\n",
146
+ "You are an expert product manager and technical writer for a software company, who generates clean, concise, precise specification documents for your employees.\n",
147
+ "Your job is to write a plaintext spec for a python script for a software engineer to develop a component within an ETL pipeline.\n",
148
+ "\n",
149
+ "This document must include 100% of the information your employee needs to write a successful script to transform source_df to target_df.\n",
150
+ "However, DO NOT include the original table_mapping. Your job is to translate everything into natural language.\n",
151
+ "\n",
152
+ "Here is a stringified pydantic object that describes the mapping and the transformation steps:\n",
153
+ "\n",
154
+ "{table_mapping}\n",
155
+ "\n",
156
+ "You must translate this into clean, concise, and complete instructions for your employee.\n",
157
+ "\n",
158
+ "This document should be formatted like a technical document in plaintext. Do not include code or data.\n",
159
+ "\n",
160
+ "This document must include:\n",
161
+ "- Overview\n",
162
+ "- Input (source_df), Output (target_df)\n",
163
+ "- Exact column mapping\n",
164
+ "- Exact transformation steps for each column\n",
165
+ "- Precise instructions for what this script should do\n",
166
+ "- Script input: Pandas Dataframe named `source_df`.\n",
167
+ "- Script output: Pandas Dataframe named `target_df`.\n",
168
+ "- Do not modify the source_df. Create a new dataframe named target_df.\n",
169
+ "- This script should never include the source data. It should only include the transormations required to create the target_df.\n",
170
+ "- Return the target_df.\n",
171
+ "\n",
172
+ "You will never see this employee. They cannot contact you. You will never see their code. You must include 100% of the information they need to write a successful script.\n",
173
+ "Remember:\n",
174
+ "- Clean: No extra information, no formatting aside from plaintext\n",
175
+ "- Concise: Your employees benefit from brevity\n",
176
+ "- Precise: your words must be unambiguous, exact, and full represent a perfect translation of the table_mapping object.\n",
177
+ "\n",
178
+ "Your response:\n",
179
+ "'''\n",
180
+ "spec_writer_prompt = ChatPromptTemplate.from_template(spec_writer_prompt_str)\n",
181
+ "\n",
182
+ "spec_writer_chain = spec_writer_prompt | natural_language_model | StrOutputParser()\n",
183
+ "spec_str = spec_writer_chain.invoke({\"table_mapping\": str(table_mapping)})"
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": 19,
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "engineer_prompt_str = '''\n",
193
+ "You are a Senior Software Engineer, who specializes in writing Python code for ETL pipelines.\n",
194
+ "Your Product Manager has written a spec for a new transormation script. You must follow this document exactly, write python code that implements the spec, validate that code, and then return it.\n",
195
+ "Your output should only be python code in Markdown format, eg:\n",
196
+ " ```python\n",
197
+ " ....\n",
198
+ " ```\"\"\"\n",
199
+ "Do not return any additional text / explanation. This code will be executed by a robot without human intervention.\n",
200
+ "\n",
201
+ "Here is the technical specification for your code:\n",
202
+ "\n",
203
+ "{spec_str}\n",
204
+ "\n",
205
+ "Remember: return only clean python code in markdown format. The python interpreter running this code will already have `source_df` as a local variable.\n",
206
+ "\n",
207
+ "Your must return `target_df` at the end.\n",
208
+ "'''\n",
209
+ "engineer_prompt = ChatPromptTemplate.from_template(engineer_prompt_str)\n",
210
+ "\n",
211
+ "# engineer_chain = engineer_prompt | transform_model | StrOutputParser() | PythonAstREPLTool(locals={'source_df': table_1_df}).run\n",
212
+ "# table_1_df_transformed = engineer_chain.invoke({\"spec_str\": spec_str})\n",
213
+ "engineer_chain = engineer_prompt | transform_model | StrOutputParser()\n",
214
+ "transform_code = engineer_chain.invoke({\"spec_str\": spec_str})"
215
+ ]
216
+ },
217
+ {
218
+ "cell_type": "code",
219
+ "execution_count": 17,
220
+ "metadata": {},
221
+ "outputs": [
222
+ {
223
+ "name": "stdout",
224
+ "output_type": "stream",
225
+ "text": [
226
+ "Running on local URL: http://127.0.0.1:7874\n",
227
+ "\n",
228
+ "To create a public link, set `share=True` in `launch()`.\n"
229
+ ]
230
+ },
231
+ {
232
+ "data": {
233
+ "text/html": [
234
+ "<div><iframe src=\"http://127.0.0.1:7874/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
235
+ ],
236
+ "text/plain": [
237
+ "<IPython.core.display.HTML object>"
238
+ ]
239
+ },
240
+ "metadata": {},
241
+ "output_type": "display_data"
242
+ },
243
+ {
244
+ "data": {
245
+ "text/plain": []
246
+ },
247
+ "execution_count": 17,
248
+ "metadata": {},
249
+ "output_type": "execute_result"
250
+ }
251
+ ],
252
+ "source": [
253
+ "def show_mapping(file):\n",
254
+ " # TODO: add code\n",
255
+ " return pd.DataFrame(table_mapping.dict()['table_mappings'])\n",
256
+ "demo = gr.Interface(fn=show_mapping, inputs=[\"file\"], outputs='dataframe')\n",
257
+ "demo.launch()"
258
+ ]
259
+ },
260
+ {
261
+ "cell_type": "code",
262
+ "execution_count": 34,
263
+ "metadata": {},
264
+ "outputs": [
265
+ {
266
+ "name": "stdout",
267
+ "output_type": "stream",
268
+ "text": [
269
+ "Running on local URL: http://127.0.0.1:7885\n",
270
+ "\n",
271
+ "Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB\n",
272
+ "\n",
273
+ "To create a public link, set `share=True` in `launch()`.\n"
274
+ ]
275
+ },
276
+ {
277
+ "data": {
278
+ "text/html": [
279
+ "<div><iframe src=\"http://127.0.0.1:7885/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
280
+ ],
281
+ "text/plain": [
282
+ "<IPython.core.display.HTML object>"
283
+ ]
284
+ },
285
+ "metadata": {},
286
+ "output_type": "display_data"
287
+ },
288
+ {
289
+ "data": {
290
+ "text/plain": []
291
+ },
292
+ "execution_count": 34,
293
+ "metadata": {},
294
+ "output_type": "execute_result"
295
+ }
296
+ ],
297
+ "source": [
298
+ "def _sanitize_python_output(text: str):\n",
299
+ " _, after = text.split(\"```python\")\n",
300
+ " return after.split(\"```\")[0]\n",
301
+ "\n",
302
+ "def show_code(button):\n",
303
+ " # TODO: add code\n",
304
+ " return _sanitize_python_output(transform_code)\n",
305
+ "check_mapping_text = 'How does that mapping look? \\n\\nFeel free to update it: your changes will be incorporated! \\n\\nWhen you are ready, click the Submit below, and the mapping code will be generated for your approval.'\n",
306
+ "demo = gr.Interface(fn=show_code, inputs=[gr.Textbox(value=check_mapping_text, interactive=False)], outputs=[gr.Code(language=\"python\")])\n",
307
+ "demo.launch()"
308
+ ]
309
+ },
310
+ {
311
+ "cell_type": "code",
312
+ "execution_count": 41,
313
+ "metadata": {},
314
+ "outputs": [
315
+ {
316
+ "name": "stderr",
317
+ "output_type": "stream",
318
+ "text": [
319
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/4236222443.py:4: GradioDeprecationWarning: `layout` parameter is deprecated, and it has no effect\n",
320
+ " demo = gr.Interface(\n"
321
+ ]
322
+ },
323
+ {
324
+ "name": "stdout",
325
+ "output_type": "stream",
326
+ "text": [
327
+ "Running on local URL: http://127.0.0.1:7892\n",
328
+ "\n",
329
+ "To create a public link, set `share=True` in `launch()`.\n"
330
+ ]
331
+ },
332
+ {
333
+ "data": {
334
+ "text/html": [
335
+ "<div><iframe src=\"http://127.0.0.1:7892/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
336
+ ],
337
+ "text/plain": [
338
+ "<IPython.core.display.HTML object>"
339
+ ]
340
+ },
341
+ "metadata": {},
342
+ "output_type": "display_data"
343
+ },
344
+ {
345
+ "data": {
346
+ "text/plain": []
347
+ },
348
+ "execution_count": 41,
349
+ "metadata": {},
350
+ "output_type": "execute_result"
351
+ }
352
+ ],
353
+ "source": [
354
+ "def get_transformed_table(button):\n",
355
+ " return template_df, PythonAstREPLTool(locals={'source_df': table_1_df}).run(transform_code)\n",
356
+ "check_mapping_text = 'How does that code look? \\n\\nWhen you are ready, click the Submit button and the transformed source file will be transformed.'\n",
357
+ "demo = gr.Interface(\n",
358
+ " fn=get_transformed_table,\n",
359
+ " inputs=[gr.Textbox(value=check_mapping_text, interactive=False)],\n",
360
+ " outputs=[gr.Dataframe(label='Template Table (target)'), gr.Dataframe(label='Table 1 (transformed)')],\n",
361
+ " layout=\"column\",\n",
362
+ " examples=[[1]],\n",
363
+ ")\n",
364
+ "demo.launch()"
365
+ ]
366
+ },
367
+ {
368
+ "cell_type": "code",
369
+ "execution_count": 89,
370
+ "metadata": {},
371
+ "outputs": [
372
+ {
373
+ "name": "stderr",
374
+ "output_type": "stream",
375
+ "text": [
376
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:18: GradioDeprecationWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components\n",
377
+ " gr.inputs.File(label=\"Template\", type=\"file\", file_count='single')\n",
378
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:18: GradioDeprecationWarning: `optional` parameter is deprecated, and it has no effect\n",
379
+ " gr.inputs.File(label=\"Template\", type=\"file\", file_count='single')\n",
380
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:18: GradioDeprecationWarning: `keep_filename` parameter is deprecated, and it has no effect\n",
381
+ " gr.inputs.File(label=\"Template\", type=\"file\", file_count='single')\n",
382
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:19: GradioDeprecationWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components\n",
383
+ " gr.inputs.File(label=\"Source\", type=\"file\", file_count='single')\n",
384
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:19: GradioDeprecationWarning: `optional` parameter is deprecated, and it has no effect\n",
385
+ " gr.inputs.File(label=\"Source\", type=\"file\", file_count='single')\n",
386
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:19: GradioDeprecationWarning: `keep_filename` parameter is deprecated, and it has no effect\n",
387
+ " gr.inputs.File(label=\"Source\", type=\"file\", file_count='single')\n",
388
+ "/Users/andybryant/Desktop/projects/zero-mapper/venv/lib/python3.9/site-packages/gradio/utils.py:841: UserWarning: Expected 1 arguments for function <function generate_code at 0x12cb559d0>, received 0.\n",
389
+ " warnings.warn(\n",
390
+ "/Users/andybryant/Desktop/projects/zero-mapper/venv/lib/python3.9/site-packages/gradio/utils.py:845: UserWarning: Expected at least 1 arguments for function <function generate_code at 0x12cb559d0>, received 0.\n",
391
+ " warnings.warn(\n",
392
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:39: GradioUnusedKwargWarning: You have unused kwarg parameters in Button, please remove them: {'trigger': 'transform_source'}\n",
393
+ " gr.Button(value=\"Transform Source\", variant=\"primary\", trigger=\"transform_source\")\n",
394
+ "/var/folders/lx/3ksh07r96gn2v7b8mb__3mpc0000gn/T/ipykernel_94012/2180252060.py:40: GradioUnusedKwargWarning: You have unused kwarg parameters in Button, please remove them: {'trigger': 'save_code'}\n",
395
+ " gr.Button(value=\"Save Code\", variant=\"secondary\", trigger=\"save_code\")\n"
396
+ ]
397
+ },
398
+ {
399
+ "name": "stdout",
400
+ "output_type": "stream",
401
+ "text": [
402
+ "Running on local URL: http://127.0.0.1:7934\n",
403
+ "\n",
404
+ "To create a public link, set `share=True` in `launch()`.\n"
405
+ ]
406
+ },
407
+ {
408
+ "data": {
409
+ "text/html": [
410
+ "<div><iframe src=\"http://127.0.0.1:7934/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
411
+ ],
412
+ "text/plain": [
413
+ "<IPython.core.display.HTML object>"
414
+ ]
415
+ },
416
+ "metadata": {},
417
+ "output_type": "display_data"
418
+ },
419
+ {
420
+ "data": {
421
+ "text/plain": []
422
+ },
423
+ "execution_count": 89,
424
+ "metadata": {},
425
+ "output_type": "execute_result"
426
+ }
427
+ ],
428
+ "source": [
429
+ "def _sanitize_python_output(text: str):\n",
430
+ " _, after = text.split(\"```python\")\n",
431
+ " return after.split(\"```\")[0]\n",
432
+ "\n",
433
+ "def do_stuff(val):\n",
434
+ " print(val)\n",
435
+ "\n",
436
+ "def generate_code(val):\n",
437
+ " return '# check this out'\n",
438
+ "\n",
439
+ "def save_csv_file(df, filename):\n",
440
+ " df.to_csv(os.path.join(data_dir_path, 'output', filename) + '.csv')\n",
441
+ "\n",
442
+ "with gr.Blocks() as demo:\n",
443
+ " with gr.Column():\n",
444
+ " gr.Markdown(\"## To begin, upload a Template CSV and a Source CSV file.\")\n",
445
+ " with gr.Row():\n",
446
+ " gr.inputs.File(label=\"Template\", type=\"file\", file_count='single')\n",
447
+ " gr.inputs.File(label=\"Source\", type=\"file\", file_count='single')\n",
448
+ "\n",
449
+ " with gr.Column():\n",
450
+ " gr.Markdown(\"## Mapping from Source to Template\")\n",
451
+ " with gr.Row():\n",
452
+ " table_mapping_df = pd.DataFrame(table_mapping.dict()['table_mappings'])\n",
453
+ " gr.DataFrame(value=table_mapping_df)\n",
454
+ " save_mapping_btn = gr.Button(value=\"Save Mapping\", variant=\"secondary\")\n",
455
+ " save_mapping_btn.click(fn=lambda : save_csv_file(table_mapping_df, 'table_mapping'))\n",
456
+ "\n",
457
+ " with gr.Row():\n",
458
+ " test = gr.Markdown()\n",
459
+ " generate_code_btn = gr.Button(value=\"Generate Code from Mapping\", variant=\"primary\")\n",
460
+ " generate_code_btn.click(fn=generate_code, outputs=test)\n",
461
+ "\n",
462
+ " with gr.Column():\n",
463
+ " gr.Markdown(\"## Here is the code that will be used to transform the source file into the template schema:\")\n",
464
+ " gr.Code(language=\"python\", value=_sanitize_python_output(transform_code))\n",
465
+ "\n",
466
+ " with gr.Row():\n",
467
+ " gr.Button(value=\"Transform Source\", variant=\"primary\", trigger=\"transform_source\")\n",
468
+ " gr.Button(value=\"Save Code\", variant=\"secondary\", trigger=\"save_code\")\n",
469
+ " \n",
470
+ " with gr.Row():\n",
471
+ " with gr.Column():\n",
472
+ " gr.Dataframe(label='Target (template)', type='pandas', value=template_df)\n",
473
+ " with gr.Column():\n",
474
+ " gr.Dataframe(label='Source (transformed)', type='pandas', value=PythonAstREPLTool(locals={'source_df': table_1_df}).run(transform_code))\n",
475
+ "\n",
476
+ "demo.launch()"
477
+ ]
478
+ },
479
+ {
480
+ "cell_type": "code",
481
+ "execution_count": null,
482
+ "metadata": {},
483
+ "outputs": [],
484
+ "source": []
485
+ }
486
+ ],
487
+ "metadata": {
488
+ "kernelspec": {
489
+ "display_name": "venv",
490
+ "language": "python",
491
+ "name": "python3"
492
+ },
493
+ "language_info": {
494
+ "codemirror_mode": {
495
+ "name": "ipython",
496
+ "version": 3
497
+ },
498
+ "file_extension": ".py",
499
+ "mimetype": "text/x-python",
500
+ "name": "python",
501
+ "nbconvert_exporter": "python",
502
+ "pygments_lexer": "ipython3",
503
+ "version": "3.9.6"
504
+ },
505
+ "orig_nbformat": 4
506
+ },
507
+ "nbformat": 4,
508
+ "nbformat_minor": 2
509
+ }
src/notebooks/data/legal_entries_a.csv ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ case_date,lastname,firstname,case_type,case_id,court_fee,jurisdiction,judge_last_name
2
+ 2023-01-16,Okafor,Jane,Civil,CR-6190,250,BOSTON,Connor
3
+ 2023-08-10,Malcolm,Elena,Civil,CR-3092,100,chicago,James
4
+ 2023-06-14,Nasser,Alan,Civil,CR-5947,150,BOSTON,Skaarsgard
5
+ 2023-07-17,Smith,Miguel,Family,CR-7727,250,LOS angeles,Brown
6
+ 2023-07-25,Kim,John,Criminal,CR-4120,150,BOSTON,Skaarsgard
7
+ 2023-07-14,Brown,John,Civil,CR-8850,100,LOS angeles,Brown
8
+ 2023-01-19,Nasser,Dmitri,Criminal,CR-2308,100,chicago,Connor
9
+ 2023-02-26,Rodriguez,Alan,Criminal,CR-4477,100,chicago,Morgan
10
+ 2023-02-10,Brown,Alice,Criminal,CR-9490,200,chicago,Morgan
11
+ 2023-09-12,Smith,Nadia,Family,CR-4111,100,LOS angeles,Skaarsgard
12
+ 2023-02-25,Kim,Chen,Criminal,CR-9421,150,BOSTON,Oleg
13
+ 2023-09-15,Kim,John,Family,CR-3270,200,houston,Morgan
14
+ 2023-07-22,Patel,Nadia,Family,CR-1501,200,houston,Skaarsgard
15
+ 2023-01-27,Lee,Lakshmi,Family,CR-8321,150,houston,Brown
16
+ 2023-01-14,Brown,John,Family,CR-2748,100,LOS angeles,James
17
+ 2023-07-13,Malcolm,Miguel,Family,CR-3163,100,LOS angeles,Skaarsgard
18
+ 2023-02-26,Smith,Alice,Civil,CR-4296,150,BOSTON,James
19
+ 2023-09-25,Patel,Terrance,Criminal,CR-2230,200,houston,Morgan
20
+ 2023-02-13,Ivanov,Alan,Family,CR-9353,100,new York,Morgan
21
+ 2023-04-18,Chatterjee,Alice,Civil,CR-8786,100,chicago,Skaarsgard
22
+ 2023-09-11,Brown,Jane,Criminal,CR-6001,100,LOS angeles,Connor
23
+ 2023-02-16,Okafor,Jane,Criminal,CR-9434,250,BOSTON,Oleg
24
+ 2023-07-22,Chatterjee,Dmitri,Criminal,CR-1042,100,BOSTON,Brown
25
+ 2023-08-28,Smith,Miguel,Family,CR-1427,150,LOS angeles,Brown
26
+ 2023-06-14,Johnson,Miguel,Civil,CR-7553,200,chicago,Skaarsgard
27
+ 2023-02-24,Ivanov,Chen,Civil,CR-2242,250,LOS angeles,Connor
28
+ 2023-06-23,Rodriguez,Terrance,Criminal,CR-6940,250,houston,Connor
29
+ 2023-01-10,Johnson,Elena,Civil,CR-4064,150,houston,Oleg
30
+ 2023-01-15,Patel,Chen,Civil,CR-3129,100,new York,Morgan
31
+ 2023-08-16,Malcolm,Oluwaseun,Civil,CR-2758,150,BOSTON,Connor
32
+ 2023-02-24,Ivanov,Lakshmi,Criminal,CR-9562,250,BOSTON,Brown
33
+ 2023-05-15,Okafor,Terrance,Criminal,CR-2292,250,BOSTON,Morgan
34
+ 2023-06-26,Patel,Jane,Criminal,CR-7889,250,LOS angeles,Brown
35
+ 2023-02-14,Rodriguez,John,Family,CR-5178,150,houston,Morgan
36
+ 2023-05-15,Patel,Terrance,Civil,CR-5004,150,houston,Morgan
37
+ 2023-03-19,Johnson,Alice,Family,CR-2883,200,new York,Morgan
38
+ 2023-02-12,Rodriguez,Alan,Family,CR-4416,200,BOSTON,Brown
39
+ 2023-07-25,Malcolm,Chen,Civil,CR-9332,200,houston,Morgan
40
+ 2023-09-15,Chatterjee,Miguel,Civil,CR-7699,250,BOSTON,Skaarsgard
41
+ 2023-03-13,Lee,Nadia,Civil,CR-7258,100,new York,James
42
+ 2023-05-27,Brown,Nadia,Civil,CR-7490,200,houston,Morgan
43
+ 2023-02-22,Johnson,Alice,Civil,CR-8231,100,chicago,Connor
44
+ 2023-03-18,Malcolm,Nadia,Criminal,CR-2720,100,new York,Oleg
45
+ 2023-06-11,Brown,Nadia,Criminal,CR-4277,100,BOSTON,Connor
46
+ 2023-02-22,Okafor,Oluwaseun,Criminal,CR-9738,100,new York,Skaarsgard
47
+ 2023-08-19,Patel,Jane,Civil,CR-2452,250,BOSTON,Skaarsgard
48
+ 2023-09-27,Lee,Alan,Family,CR-1899,100,new York,Skaarsgard
49
+ 2023-04-21,Malcolm,Dmitri,Family,CR-8404,150,LOS angeles,Brown
50
+ 2023-03-10,Chatterjee,Alice,Family,CR-4240,100,LOS angeles,Morgan
51
+ 2023-05-13,Kim,Elena,Family,CR-6153,250,chicago,Brown
52
+ 2023-09-10,Patel,Alan,Criminal,CR-3485,200,chicago,Oleg
53
+ 2023-08-18,Kim,Lakshmi,Criminal,CR-5520,200,LOS angeles,James
54
+ 2023-02-21,Patel,Alan,Criminal,CR-9879,250,LOS angeles,Brown
55
+ 2023-05-12,Brown,Jane,Criminal,CR-5259,200,new York,James
56
+ 2023-01-20,Patel,Oluwaseun,Criminal,CR-8333,100,BOSTON,James
57
+ 2023-01-23,Nasser,Chen,Civil,CR-2711,200,LOS angeles,Morgan
58
+ 2023-03-12,Brown,Miguel,Family,CR-5100,100,LOS angeles,Morgan
59
+ 2023-01-15,Rodriguez,Terrance,Criminal,CR-4849,100,LOS angeles,Morgan
60
+ 2023-05-17,Lee,Jane,Criminal,CR-8058,150,new York,Skaarsgard
61
+ 2023-04-18,Okafor,Chen,Civil,CR-9076,100,new York,Brown
62
+ 2023-02-22,Chatterjee,Lakshmi,Criminal,CR-5230,200,BOSTON,Morgan
63
+ 2023-08-18,Brown,John,Criminal,CR-7094,200,LOS angeles,Connor
64
+ 2023-08-17,Lee,Oluwaseun,Civil,CR-8915,150,BOSTON,Oleg
65
+ 2023-08-18,Malcolm,Alan,Family,CR-9030,100,chicago,Brown
66
+ 2023-02-13,Malcolm,Chen,Criminal,CR-1482,150,houston,Morgan
67
+ 2023-02-16,Brown,John,Criminal,CR-3535,100,BOSTON,Morgan
68
+ 2023-08-20,Johnson,Chen,Criminal,CR-2029,250,houston,James
69
+ 2023-01-10,Kim,Alan,Civil,CR-1812,250,houston,Oleg
70
+ 2023-02-18,Chatterjee,Alice,Civil,CR-5295,150,chicago,Oleg
71
+ 2023-08-25,Lee,Miguel,Criminal,CR-6850,150,LOS angeles,Brown
72
+ 2023-05-12,Malcolm,Alan,Criminal,CR-7973,150,BOSTON,Connor
73
+ 2023-05-19,Johnson,Chen,Family,CR-5221,200,houston,Skaarsgard
74
+ 2023-06-17,Okafor,John,Criminal,CR-4117,250,BOSTON,Oleg
75
+ 2023-03-18,Patel,Elena,Family,CR-2368,100,houston,Skaarsgard
76
+ 2023-06-22,Rodriguez,Lakshmi,Family,CR-8384,200,new York,Oleg
77
+ 2023-07-14,Smith,Miguel,Civil,CR-4476,100,new York,Morgan
78
+ 2023-03-26,Brown,Chen,Civil,CR-4545,100,houston,Morgan
79
+ 2023-06-22,Chatterjee,Dmitri,Civil,CR-4421,250,houston,Skaarsgard
80
+ 2023-03-20,Patel,Miguel,Criminal,CR-6559,150,new York,Morgan
81
+ 2023-07-11,Kim,Oluwaseun,Civil,CR-1803,250,BOSTON,Brown
82
+ 2023-03-13,Okafor,Elena,Civil,CR-8622,150,new York,Brown
83
+ 2023-05-27,Lee,Alice,Criminal,CR-9488,200,LOS angeles,Connor
84
+ 2023-05-14,Patel,Alice,Civil,CR-4581,150,chicago,Brown
85
+ 2023-06-27,Malcolm,Terrance,Criminal,CR-2388,250,chicago,James
86
+ 2023-02-13,Ivanov,Terrance,Criminal,CR-6529,100,LOS angeles,Connor
87
+ 2023-01-21,Patel,Terrance,Family,CR-4443,150,BOSTON,Oleg
88
+ 2023-09-22,Malcolm,John,Civil,CR-8721,200,new York,Skaarsgard
89
+ 2023-02-12,Malcolm,Miguel,Family,CR-3780,250,new York,Oleg
90
+ 2023-04-26,Kim,Alan,Criminal,CR-2663,250,new York,Connor
91
+ 2023-03-16,Ivanov,Lakshmi,Criminal,CR-8702,150,LOS angeles,Morgan
92
+ 2023-07-22,Ivanov,Jane,Criminal,CR-1232,100,BOSTON,Connor
93
+ 2023-05-28,Okafor,Nadia,Family,CR-5215,150,houston,Brown
94
+ 2023-03-14,Okafor,Oluwaseun,Criminal,CR-6631,250,BOSTON,Skaarsgard
95
+ 2023-06-17,Nasser,Alan,Civil,CR-1405,150,BOSTON,Morgan
96
+ 2023-08-13,Kim,Oluwaseun,Civil,CR-8816,100,LOS angeles,James
97
+ 2023-07-20,Brown,Oluwaseun,Family,CR-2665,150,new York,Morgan
98
+ 2023-05-16,Patel,Alan,Family,CR-2874,100,new York,Connor
99
+ 2023-07-15,Chatterjee,Nadia,Family,CR-2037,100,houston,James
100
+ 2023-04-18,Johnson,Dmitri,Criminal,CR-5402,200,houston,Morgan
101
+ 2023-08-14,Johnson,Chen,Civil,CR-3569,250,BOSTON,Morgan
src/notebooks/data/legal_entries_b.csv ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Date_of_Case,Fee,FullName,CaseNumber,CaseKind,Location,Weather
2
+ 2023/01/16,250.0,Jane Okafor,case--6190,Civil,BOSTO,snowy
3
+ 2023/08/10,100.0,Elena Malcolm,case--3092,Civil,CHIC,sunny
4
+ 2023/06/14,150.0,Alan Nasser,case--5947,Civil,BOSTO,rainy
5
+ 2023/07/17,250.0,Miguel Smith,case--7727,Family,LOSAN,snowy
6
+ 2023/07/25,150.0,John Kim,case--4120,Criminal,BOSTO,rainy
7
+ 2023/07/14,100.0,John Brown,case--8850,Civil,LOSAN,snowy
8
+ 2023/01/19,100.0,Dmitri Nasser,case--2308,Criminal,CHIC,sunny
9
+ 2023/02/26,100.0,Alan Rodriguez,case--4477,Criminal,CHIC,cloudy
10
+ 2023/02/10,200.0,Alice Brown,case--9490,Criminal,CHICA,snowy
11
+ 2023/09/12,100.0,Nadia Smith,case--4111,Family,LOSA,sunny
12
+ 2023/02/25,150.0,Chen Kim,case--9421,Criminal,BOST,sunny
13
+ 2023/09/15,200.0,John Kim,case--3270,Family,HOUST,sunny
14
+ 2023/07/22,200.0,Nadia Patel,case--1501,Family,HOUS,cloudy
15
+ 2023/01/27,150.0,Lakshmi Lee,case--8321,Family,HOUST,snowy
16
+ 2023/01/14,100.0,John Brown,case--2748,Family,LOSAN,snowy
17
+ 2023/07/13,100.0,Miguel Malcolm,case--3163,Family,LOSA,rainy
18
+ 2023/02/26,150.0,Alice Smith,case--4296,Civil,BOSTO,cloudy
19
+ 2023/09/25,200.0,Terrance Patel,case--2230,Criminal,HOUS,snowy
20
+ 2023/02/13,100.0,Alan Ivanov,case--9353,Family,NEWY,sunny
21
+ 2023/04/18,100.0,Alice Chatterjee,case--8786,Civil,CHICA,rainy
22
+ 2023/09/11,100.0,Jane Brown,case--6001,Criminal,LOSA,snowy
23
+ 2023/02/16,250.0,Jane Okafor,case--9434,Criminal,BOST,snowy
24
+ 2023/07/22,100.0,Dmitri Chatterjee,case--1042,Criminal,BOST,rainy
25
+ 2023/08/28,150.0,Miguel Smith,case--1427,Family,LOSA,cloudy
26
+ 2023/06/14,200.0,Miguel Johnson,case--7553,Civil,CHIC,sunny
27
+ 2023/02/24,250.0,Chen Ivanov,case--2242,Civil,LOSA,rainy
28
+ 2023/06/23,250.0,Terrance Rodriguez,case--6940,Criminal,HOUST,rainy
29
+ 2023/01/10,150.0,Elena Johnson,case--4064,Civil,HOUS,rainy
30
+ 2023/01/15,100.0,Chen Patel,case--3129,Civil,NEWY,rainy
31
+ 2023/08/16,150.0,Oluwaseun Malcolm,case--2758,Civil,BOSTO,snowy
32
+ 2023/02/24,250.0,Lakshmi Ivanov,case--9562,Criminal,BOSTO,sunny
33
+ 2023/05/15,250.0,Terrance Okafor,case--2292,Criminal,BOST,rainy
34
+ 2023/06/26,250.0,Jane Patel,case--7889,Criminal,LOSAN,cloudy
35
+ 2023/02/14,150.0,John Rodriguez,case--5178,Family,HOUS,sunny
36
+ 2023/05/15,150.0,Terrance Patel,case--5004,Civil,HOUST,snowy
37
+ 2023/03/19,200.0,Alice Johnson,case--2883,Family,NEWYO,snowy
38
+ 2023/02/12,200.0,Alan Rodriguez,case--4416,Family,BOSTO,rainy
39
+ 2023/07/25,200.0,Chen Malcolm,case--9332,Civil,HOUS,snowy
40
+ 2023/09/15,250.0,Miguel Chatterjee,case--7699,Civil,BOST,rainy
41
+ 2023/03/13,100.0,Nadia Lee,case--7258,Civil,NEWYO,snowy
42
+ 2023/05/27,200.0,Nadia Brown,case--7490,Civil,HOUS,snowy
43
+ 2023/02/22,100.0,Alice Johnson,case--8231,Civil,CHIC,cloudy
44
+ 2023/03/18,100.0,Nadia Malcolm,case--2720,Criminal,NEWY,cloudy
45
+ 2023/06/11,100.0,Nadia Brown,case--4277,Criminal,BOST,snowy
46
+ 2023/02/22,100.0,Oluwaseun Okafor,case--9738,Criminal,NEWYO,snowy
47
+ 2023/08/19,250.0,Jane Patel,case--2452,Civil,BOSTO,snowy
48
+ 2023/09/27,100.0,Alan Lee,case--1899,Family,NEWY,rainy
49
+ 2023/04/21,150.0,Dmitri Malcolm,case--8404,Family,LOSAN,rainy
50
+ 2023/03/10,100.0,Alice Chatterjee,case--4240,Family,LOSA,snowy
51
+ 2023/05/13,250.0,Elena Kim,case--6153,Family,CHIC,rainy
52
+ 2023/09/10,200.0,Alan Patel,case--3485,Criminal,CHIC,cloudy
53
+ 2023/08/18,200.0,Lakshmi Kim,case--5520,Criminal,LOSAN,sunny
54
+ 2023/02/21,250.0,Alan Patel,case--9879,Criminal,LOSA,sunny
55
+ 2023/05/12,200.0,Jane Brown,case--5259,Criminal,NEWYO,rainy
56
+ 2023/01/20,100.0,Oluwaseun Patel,case--8333,Criminal,BOSTO,cloudy
57
+ 2023/01/23,200.0,Chen Nasser,case--2711,Civil,LOSAN,sunny
58
+ 2023/03/12,100.0,Miguel Brown,case--5100,Family,LOSAN,sunny
59
+ 2023/01/15,100.0,Terrance Rodriguez,case--4849,Criminal,LOSAN,rainy
60
+ 2023/05/17,150.0,Jane Lee,case--8058,Criminal,NEWY,cloudy
61
+ 2023/04/18,100.0,Chen Okafor,case--9076,Civil,NEWYO,sunny
62
+ 2023/02/22,200.0,Lakshmi Chatterjee,case--5230,Criminal,BOST,rainy
63
+ 2023/08/18,200.0,John Brown,case--7094,Criminal,LOSA,cloudy
64
+ 2023/08/17,150.0,Oluwaseun Lee,case--8915,Civil,BOSTO,sunny
65
+ 2023/08/18,100.0,Alan Malcolm,case--9030,Family,CHIC,sunny
66
+ 2023/02/13,150.0,Chen Malcolm,case--1482,Criminal,HOUS,cloudy
67
+ 2023/02/16,100.0,John Brown,case--3535,Criminal,BOST,rainy
68
+ 2023/08/20,250.0,Chen Johnson,case--2029,Criminal,HOUST,sunny
69
+ 2023/01/10,250.0,Alan Kim,case--1812,Civil,HOUST,sunny
70
+ 2023/02/18,150.0,Alice Chatterjee,case--5295,Civil,CHICA,snowy
71
+ 2023/08/25,150.0,Miguel Lee,case--6850,Criminal,LOSA,sunny
72
+ 2023/05/12,150.0,Alan Malcolm,case--7973,Criminal,BOST,cloudy
73
+ 2023/05/19,200.0,Chen Johnson,case--5221,Family,HOUS,snowy
74
+ 2023/06/17,250.0,John Okafor,case--4117,Criminal,BOSTO,sunny
75
+ 2023/03/18,100.0,Elena Patel,case--2368,Family,HOUST,rainy
76
+ 2023/06/22,200.0,Lakshmi Rodriguez,case--8384,Family,NEWY,cloudy
77
+ 2023/07/14,100.0,Miguel Smith,case--4476,Civil,NEWYO,cloudy
78
+ 2023/03/26,100.0,Chen Brown,case--4545,Civil,HOUST,snowy
79
+ 2023/06/22,250.0,Dmitri Chatterjee,case--4421,Civil,HOUS,snowy
80
+ 2023/03/20,150.0,Miguel Patel,case--6559,Criminal,NEWYO,snowy
81
+ 2023/07/11,250.0,Oluwaseun Kim,case--1803,Civil,BOSTO,sunny
82
+ 2023/03/13,150.0,Elena Okafor,case--8622,Civil,NEWYO,cloudy
83
+ 2023/05/27,200.0,Alice Lee,case--9488,Criminal,LOSAN,cloudy
84
+ 2023/05/14,150.0,Alice Patel,case--4581,Civil,CHICA,sunny
85
+ 2023/06/27,250.0,Terrance Malcolm,case--2388,Criminal,CHIC,sunny
86
+ 2023/02/13,100.0,Terrance Ivanov,case--6529,Criminal,LOSAN,snowy
87
+ 2023/01/21,150.0,Terrance Patel,case--4443,Family,BOST,sunny
88
+ 2023/09/22,200.0,John Malcolm,case--8721,Civil,NEWYO,snowy
89
+ 2023/02/12,250.0,Miguel Malcolm,case--3780,Family,NEWYO,cloudy
90
+ 2023/04/26,250.0,Alan Kim,case--2663,Criminal,NEWY,rainy
91
+ 2023/03/16,150.0,Lakshmi Ivanov,case--8702,Criminal,LOSA,snowy
92
+ 2023/07/22,100.0,Jane Ivanov,case--1232,Criminal,BOSTO,rainy
93
+ 2023/05/28,150.0,Nadia Okafor,case--5215,Family,HOUS,cloudy
94
+ 2023/03/14,250.0,Oluwaseun Okafor,case--6631,Criminal,BOST,rainy
95
+ 2023/06/17,150.0,Alan Nasser,case--1405,Civil,BOST,snowy
96
+ 2023/08/13,100.0,Oluwaseun Kim,case--8816,Civil,LOSAN,cloudy
97
+ 2023/07/20,150.0,Oluwaseun Brown,case--2665,Family,NEWYO,sunny
98
+ 2023/05/16,100.0,Alan Patel,case--2874,Family,NEWYO,sunny
99
+ 2023/07/15,100.0,Nadia Chatterjee,case--2037,Family,HOUST,rainy
100
+ 2023/04/18,200.0,Dmitri Johnson,case--5402,Criminal,HOUS,snowy
101
+ 2023/08/14,250.0,Chen Johnson,case--3569,Civil,BOST,sunny
src/notebooks/data/legal_template.csv ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CaseDate,FullName,CaseType,CaseID,Fee,Jurisdiction
2
+ 2023-01-16,Jane Okafor,Civil,CASE-6190,250,Boston
3
+ 2023-08-10,Elena Malcolm,Civil,CASE-3092,100,Chicago
4
+ 2023-06-14,Alan Nasser,Civil,CASE-5947,150,Boston
5
+ 2023-07-17,Miguel Smith,Family,CASE-7727,250,Los Angeles
6
+ 2023-07-25,John Kim,Criminal,CASE-4120,150,Boston
7
+ 2023-07-14,John Brown,Civil,CASE-8850,100,Los Angeles
8
+ 2023-01-19,Dmitri Nasser,Criminal,CASE-2308,100,Chicago
9
+ 2023-02-26,Alan Rodriguez,Criminal,CASE-4477,100,Chicago
10
+ 2023-02-10,Alice Brown,Criminal,CASE-9490,200,Chicago
11
+ 2023-09-12,Nadia Smith,Family,CASE-4111,100,Los Angeles
12
+ 2023-02-25,Chen Kim,Criminal,CASE-9421,150,Boston
13
+ 2023-09-15,John Kim,Family,CASE-3270,200,Houston
14
+ 2023-07-22,Nadia Patel,Family,CASE-1501,200,Houston
15
+ 2023-01-27,Lakshmi Lee,Family,CASE-8321,150,Houston
16
+ 2023-01-14,John Brown,Family,CASE-2748,100,Los Angeles
17
+ 2023-07-13,Miguel Malcolm,Family,CASE-3163,100,Los Angeles
18
+ 2023-02-26,Alice Smith,Civil,CASE-4296,150,Boston
19
+ 2023-09-25,Terrance Patel,Criminal,CASE-2230,200,Houston
20
+ 2023-02-13,Alan Ivanov,Family,CASE-9353,100,New York
21
+ 2023-04-18,Alice Chatterjee,Civil,CASE-8786,100,Chicago
22
+ 2023-09-11,Jane Brown,Criminal,CASE-6001,100,Los Angeles
23
+ 2023-02-16,Jane Okafor,Criminal,CASE-9434,250,Boston
24
+ 2023-07-22,Dmitri Chatterjee,Criminal,CASE-1042,100,Boston
25
+ 2023-08-28,Miguel Smith,Family,CASE-1427,150,Los Angeles
26
+ 2023-06-14,Miguel Johnson,Civil,CASE-7553,200,Chicago
27
+ 2023-02-24,Chen Ivanov,Civil,CASE-2242,250,Los Angeles
28
+ 2023-06-23,Terrance Rodriguez,Criminal,CASE-6940,250,Houston
29
+ 2023-01-10,Elena Johnson,Civil,CASE-4064,150,Houston
30
+ 2023-01-15,Chen Patel,Civil,CASE-3129,100,New York
31
+ 2023-08-16,Oluwaseun Malcolm,Civil,CASE-2758,150,Boston
32
+ 2023-02-24,Lakshmi Ivanov,Criminal,CASE-9562,250,Boston
33
+ 2023-05-15,Terrance Okafor,Criminal,CASE-2292,250,Boston
34
+ 2023-06-26,Jane Patel,Criminal,CASE-7889,250,Los Angeles
35
+ 2023-02-14,John Rodriguez,Family,CASE-5178,150,Houston
36
+ 2023-05-15,Terrance Patel,Civil,CASE-5004,150,Houston
37
+ 2023-03-19,Alice Johnson,Family,CASE-2883,200,New York
38
+ 2023-02-12,Alan Rodriguez,Family,CASE-4416,200,Boston
39
+ 2023-07-25,Chen Malcolm,Civil,CASE-9332,200,Houston
40
+ 2023-09-15,Miguel Chatterjee,Civil,CASE-7699,250,Boston
41
+ 2023-03-13,Nadia Lee,Civil,CASE-7258,100,New York
42
+ 2023-05-27,Nadia Brown,Civil,CASE-7490,200,Houston
43
+ 2023-02-22,Alice Johnson,Civil,CASE-8231,100,Chicago
44
+ 2023-03-18,Nadia Malcolm,Criminal,CASE-2720,100,New York
45
+ 2023-06-11,Nadia Brown,Criminal,CASE-4277,100,Boston
46
+ 2023-02-22,Oluwaseun Okafor,Criminal,CASE-9738,100,New York
47
+ 2023-08-19,Jane Patel,Civil,CASE-2452,250,Boston
48
+ 2023-09-27,Alan Lee,Family,CASE-1899,100,New York
49
+ 2023-04-21,Dmitri Malcolm,Family,CASE-8404,150,Los Angeles
50
+ 2023-03-10,Alice Chatterjee,Family,CASE-4240,100,Los Angeles
51
+ 2023-05-13,Elena Kim,Family,CASE-6153,250,Chicago
52
+ 2023-09-10,Alan Patel,Criminal,CASE-3485,200,Chicago
53
+ 2023-08-18,Lakshmi Kim,Criminal,CASE-5520,200,Los Angeles
54
+ 2023-02-21,Alan Patel,Criminal,CASE-9879,250,Los Angeles
55
+ 2023-05-12,Jane Brown,Criminal,CASE-5259,200,New York
56
+ 2023-01-20,Oluwaseun Patel,Criminal,CASE-8333,100,Boston
57
+ 2023-01-23,Chen Nasser,Civil,CASE-2711,200,Los Angeles
58
+ 2023-03-12,Miguel Brown,Family,CASE-5100,100,Los Angeles
59
+ 2023-01-15,Terrance Rodriguez,Criminal,CASE-4849,100,Los Angeles
60
+ 2023-05-17,Jane Lee,Criminal,CASE-8058,150,New York
61
+ 2023-04-18,Chen Okafor,Civil,CASE-9076,100,New York
62
+ 2023-02-22,Lakshmi Chatterjee,Criminal,CASE-5230,200,Boston
63
+ 2023-08-18,John Brown,Criminal,CASE-7094,200,Los Angeles
64
+ 2023-08-17,Oluwaseun Lee,Civil,CASE-8915,150,Boston
65
+ 2023-08-18,Alan Malcolm,Family,CASE-9030,100,Chicago
66
+ 2023-02-13,Chen Malcolm,Criminal,CASE-1482,150,Houston
67
+ 2023-02-16,John Brown,Criminal,CASE-3535,100,Boston
68
+ 2023-08-20,Chen Johnson,Criminal,CASE-2029,250,Houston
69
+ 2023-01-10,Alan Kim,Civil,CASE-1812,250,Houston
70
+ 2023-02-18,Alice Chatterjee,Civil,CASE-5295,150,Chicago
71
+ 2023-08-25,Miguel Lee,Criminal,CASE-6850,150,Los Angeles
72
+ 2023-05-12,Alan Malcolm,Criminal,CASE-7973,150,Boston
73
+ 2023-05-19,Chen Johnson,Family,CASE-5221,200,Houston
74
+ 2023-06-17,John Okafor,Criminal,CASE-4117,250,Boston
75
+ 2023-03-18,Elena Patel,Family,CASE-2368,100,Houston
76
+ 2023-06-22,Lakshmi Rodriguez,Family,CASE-8384,200,New York
77
+ 2023-07-14,Miguel Smith,Civil,CASE-4476,100,New York
78
+ 2023-03-26,Chen Brown,Civil,CASE-4545,100,Houston
79
+ 2023-06-22,Dmitri Chatterjee,Civil,CASE-4421,250,Houston
80
+ 2023-03-20,Miguel Patel,Criminal,CASE-6559,150,New York
81
+ 2023-07-11,Oluwaseun Kim,Civil,CASE-1803,250,Boston
82
+ 2023-03-13,Elena Okafor,Civil,CASE-8622,150,New York
83
+ 2023-05-27,Alice Lee,Criminal,CASE-9488,200,Los Angeles
84
+ 2023-05-14,Alice Patel,Civil,CASE-4581,150,Chicago
85
+ 2023-06-27,Terrance Malcolm,Criminal,CASE-2388,250,Chicago
86
+ 2023-02-13,Terrance Ivanov,Criminal,CASE-6529,100,Los Angeles
87
+ 2023-01-21,Terrance Patel,Family,CASE-4443,150,Boston
88
+ 2023-09-22,John Malcolm,Civil,CASE-8721,200,New York
89
+ 2023-02-12,Miguel Malcolm,Family,CASE-3780,250,New York
90
+ 2023-04-26,Alan Kim,Criminal,CASE-2663,250,New York
91
+ 2023-03-16,Lakshmi Ivanov,Criminal,CASE-8702,150,Los Angeles
92
+ 2023-07-22,Jane Ivanov,Criminal,CASE-1232,100,Boston
93
+ 2023-05-28,Nadia Okafor,Family,CASE-5215,150,Houston
94
+ 2023-03-14,Oluwaseun Okafor,Criminal,CASE-6631,250,Boston
95
+ 2023-06-17,Alan Nasser,Civil,CASE-1405,150,Boston
96
+ 2023-08-13,Oluwaseun Kim,Civil,CASE-8816,100,Los Angeles
97
+ 2023-07-20,Oluwaseun Brown,Family,CASE-2665,150,New York
98
+ 2023-05-16,Alan Patel,Family,CASE-2874,100,New York
99
+ 2023-07-15,Nadia Chatterjee,Family,CASE-2037,100,Houston
100
+ 2023-04-18,Dmitri Johnson,Criminal,CASE-5402,200,Houston
101
+ 2023-08-14,Chen Johnson,Civil,CASE-3569,250,Boston
src/notebooks/data/output/table_mapping ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ,source_column_name,target_column_name,value_transformations,explanation
2
+ 0,case_date,CaseDate,NO_TRANSFORM,The 'case_date' column in the source table maps directly to the 'CaseDate' column in the target table with no transformation needed.
3
+ 1,lastname,FullName,"Concatenate 'lastname' and 'firstname' from source table, with a space in between, and reverse the order.",The 'lastname' and 'firstname' columns in the source table are combined and reversed to form the 'FullName' column in the target table.
4
+ 2,firstname,FullName,"Concatenate 'lastname' and 'firstname' from source table, with a space in between, and reverse the order.",The 'lastname' and 'firstname' columns in the source table are combined and reversed to form the 'FullName' column in the target table.
5
+ 3,case_type,CaseType,"Correct spelling errors in source table ('Familly' to 'Family', 'Criminl' to 'Criminal', 'Civl' to 'Civil').","The 'case_type' column in the source table maps to the 'CaseType' column in the target table, with spelling corrections needed."
6
+ 4,case_id,CaseID,Replace 'CR-' prefix in source table with 'CASE-' prefix.,"The 'case_id' column in the source table maps to the 'CaseID' column in the target table, with a prefix change needed."
7
+ 5,court_fee,Fee,NO_TRANSFORM,The 'court_fee' column in the source table maps directly to the 'Fee' column in the target table with no transformation needed.
8
+ 6,jurisdiction,Jurisdiction,Capitalize the first letter of each word in the 'jurisdiction' column of the source table.,"The 'jurisdiction' column in the source table maps to the 'Jurisdiction' column in the target table, with capitalization needed."
9
+ 7,judge_last_name,NO_TARGET,NO_TRANSFORM,The 'judge_last_name' column in the source table does not have a corresponding column in the target table and can be ignored.
src/notebooks/data/output/table_mapping.csv ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ,source_column_name,target_column_name,value_transformations,explanation
2
+ 0,case_date,CaseDate,NO_TRANSFORM,The 'case_date' column in the source table maps directly to the 'CaseDate' column in the target table with no transformation needed.
3
+ 1,lastname,FullName,"Concatenate 'lastname' and 'firstname' from source table, with a space in between, and reverse the order.",The 'lastname' and 'firstname' columns in the source table are combined and reversed to form the 'FullName' column in the target table.
4
+ 2,firstname,FullName,"Concatenate 'lastname' and 'firstname' from source table, with a space in between, and reverse the order.",The 'lastname' and 'firstname' columns in the source table are combined and reversed to form the 'FullName' column in the target table.
5
+ 3,case_type,CaseType,"Correct spelling errors in source table ('Familly' to 'Family', 'Criminl' to 'Criminal', 'Civl' to 'Civil').","The 'case_type' column in the source table maps to the 'CaseType' column in the target table, with spelling corrections needed."
6
+ 4,case_id,CaseID,Replace 'CR-' prefix in source table with 'CASE-' prefix.,"The 'case_id' column in the source table maps to the 'CaseID' column in the target table, with a prefix change needed."
7
+ 5,court_fee,Fee,NO_TRANSFORM,The 'court_fee' column in the source table maps directly to the 'Fee' column in the target table with no transformation needed.
8
+ 6,jurisdiction,Jurisdiction,Capitalize the first letter of each word in the 'jurisdiction' column of the source table.,"The 'jurisdiction' column in the source table maps to the 'Jurisdiction' column in the target table, with capitalization needed."
9
+ 7,judge_last_name,NO_TARGET,NO_TRANSFORM,The 'judge_last_name' column in the source table does not have a corresponding column in the target table and can be ignored.
src/notebooks/generate.ipynb ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 2,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "import os\n",
10
+ "import random\n",
11
+ "import pandas as pd"
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "code",
16
+ "execution_count": 3,
17
+ "metadata": {},
18
+ "outputs": [],
19
+ "source": [
20
+ "num_records = 100\n",
21
+ "first_names = [\"Alan\", \"Miguel\", \"Lakshmi\", \"Chen\", \"Oluwaseun\", \"Dmitri\", \"Nadia\", \"John\", \"Jane\", \"Alice\", \"Terrance\", \"Elena\"]\n",
22
+ "last_names = [\"Patel\", \"Rodriguez\", \"Kim\", \"Okafor\", \"Nasser\", \"Ivanov\", \"Smith\", \"Brown\", \"Johnson\", \"Lee\", \"Malcolm\", \"Chatterjee\"]\n",
23
+ "# new last names\n",
24
+ "judge_last_names = [\"James\", \"Skaarsgard\", \"Oleg\", \"Morgan\", \"Brown\", \"Connor\"]\n",
25
+ "case_types = [\"Criminal\", \"Civil\", \"Familly\", \"Criminl\", \"Civl\", \"Family\"]\n",
26
+ "cities = ['new York', 'LOS angeles', 'chicago', 'houston', 'BOSTON']\n",
27
+ "weather_types = ['sunny', 'cloudy', 'rainy', 'snowy']\n",
28
+ "get_random_date = lambda: f'2023-0{random.randint(1,9)}-{random.randint(10,28)}'\n",
29
+ "get_random_case_number = lambda: f'{random.choice([\"CR\", \"CASE\"])}-{random.randint(1000,9999)}'\n",
30
+ "get_random_fee = lambda: random.choice([100, 150, 200, 250])\n",
31
+ "\n",
32
+ "get_random_int = lambda: random.randint(1000, 9999)\n",
33
+ "\n",
34
+ "# Define column names with standard casing\n",
35
+ "case_date_col = 'case_date'\n",
36
+ "lastname_col = 'lastname'\n",
37
+ "firstname_col = 'firstname'\n",
38
+ "case_type_col = 'case_type'\n",
39
+ "case_id_col = 'case_id'\n",
40
+ "court_fee_col = 'court_fee'\n",
41
+ "jurisdiction_col = 'jurisdiction'\n",
42
+ "case_id_prefix = 'CR'\n",
43
+ "\n",
44
+ "# Use inconsistent casing for values\n",
45
+ "legal_entries_a_data = {\n",
46
+ " case_date_col: [get_random_date() for _ in range(num_records)],\n",
47
+ " lastname_col: [random.choice(last_names) for _ in range(num_records)],\n",
48
+ " firstname_col: [random.choice(first_names) for _ in range(num_records)],\n",
49
+ " case_type_col: [random.choice(case_types) for _ in range(num_records)],\n",
50
+ " case_id_col: [f'{case_id_prefix}-{get_random_int()}' for _ in range(num_records)],\n",
51
+ " court_fee_col: [get_random_fee() for _ in range(num_records)],\n",
52
+ " jurisdiction_col: [random.choice(cities) for _ in range(num_records)],\n",
53
+ " 'judge_last_name': [random.choice(judge_last_names) for _ in range(num_records)]\n",
54
+ "}\n",
55
+ "\n",
56
+ "legal_entries_a = pd.DataFrame(legal_entries_a_data)\n",
57
+ "\n",
58
+ "# Apply the transform_row function for legal_entries_A to create legal_entries_B\n",
59
+ "def transform_row(row):\n",
60
+ " return pd.Series({\n",
61
+ " 'Date_of_Case' : row[case_date_col].replace(\"-\", \"/\"),\n",
62
+ " 'Fee' : float(row[court_fee_col]),\n",
63
+ " 'FullName' : f\"{row[firstname_col]} {row[lastname_col]}\".title(),\n",
64
+ " 'CaseNumber' : f'{row[case_id_col].replace(case_id_prefix, \"case-\")}',\n",
65
+ " 'CaseKind' : row[case_type_col].capitalize(),\n",
66
+ " 'Date_of_Case' : row[case_date_col].replace(\"-\", \"/\"),\n",
67
+ " 'Location' : row[jurisdiction_col].replace(\" \", \"\")[:random.randint(4,5)].upper(),\n",
68
+ " 'Weather': random.choice(weather_types)\n",
69
+ " })\n",
70
+ "\n",
71
+ "def transform_to_template(row):\n",
72
+ " return pd.Series({\n",
73
+ " 'CaseDate': row[case_date_col],\n",
74
+ " 'FullName': f\"{row[firstname_col]} {row[lastname_col]}\",\n",
75
+ " 'CaseType': \"Family\" if \"Fam\" in row[case_type_col] else \"Civil\" if 'Civ' in row[case_type_col] else row[case_type_col].capitalize(),\n",
76
+ " 'CaseID': f'{row[case_id_col].replace(case_id_prefix, \"CASE\")}',\n",
77
+ " 'Fee': row[court_fee_col],\n",
78
+ " 'Jurisdiction': row[jurisdiction_col].title()\n",
79
+ " })\n",
80
+ "\n",
81
+ "legal_entries_b = legal_entries_a.apply(transform_row, axis=1)\n",
82
+ "legal_template = legal_entries_a.apply(transform_to_template, axis=1)\n",
83
+ "\n",
84
+ "data_dir_path = os.path.join(os.getcwd(), \"data\")\n",
85
+ "legal_entries_a.to_csv(os.path.join(data_dir_path, \"legal_entries_a.csv\"), index=False)\n",
86
+ "legal_entries_b.to_csv(os.path.join(data_dir_path, \"legal_entries_b.csv\"), index=False)\n",
87
+ "legal_template.to_csv(os.path.join(data_dir_path, \"legal_template.csv\"), index=False)"
88
+ ]
89
+ },
90
+ {
91
+ "cell_type": "code",
92
+ "execution_count": null,
93
+ "metadata": {},
94
+ "outputs": [],
95
+ "source": []
96
+ }
97
+ ],
98
+ "metadata": {
99
+ "kernelspec": {
100
+ "display_name": "venv",
101
+ "language": "python",
102
+ "name": "python3"
103
+ },
104
+ "language_info": {
105
+ "codemirror_mode": {
106
+ "name": "ipython",
107
+ "version": 3
108
+ },
109
+ "file_extension": ".py",
110
+ "mimetype": "text/x-python",
111
+ "name": "python",
112
+ "nbconvert_exporter": "python",
113
+ "pygments_lexer": "ipython3",
114
+ "version": "3.9.6"
115
+ },
116
+ "orig_nbformat": 4
117
+ },
118
+ "nbformat": 4,
119
+ "nbformat_minor": 2
120
+ }
src/notebooks/table_mapping ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ,source_column_name,target_column_name,value_transformations,explanation
2
+ 0,case_date,CaseDate,NO_TRANSFORM,The 'case_date' column in the source table maps directly to the 'CaseDate' column in the target table with no transformation needed.
3
+ 1,lastname,FullName,"Concatenate 'lastname' and 'firstname' from source table, with a space in between, and reverse the order.",The 'lastname' and 'firstname' columns in the source table are combined and reversed to form the 'FullName' column in the target table.
4
+ 2,firstname,FullName,"Concatenate 'lastname' and 'firstname' from source table, with a space in between, and reverse the order.",The 'lastname' and 'firstname' columns in the source table are combined and reversed to form the 'FullName' column in the target table.
5
+ 3,case_type,CaseType,"Correct spelling errors in source table ('Familly' to 'Family', 'Criminl' to 'Criminal', 'Civl' to 'Civil').","The 'case_type' column in the source table maps to the 'CaseType' column in the target table, with spelling corrections needed."
6
+ 4,case_id,CaseID,Replace 'CR-' prefix in source table with 'CASE-' prefix.,"The 'case_id' column in the source table maps to the 'CaseID' column in the target table, with a prefix change needed."
7
+ 5,court_fee,Fee,NO_TRANSFORM,The 'court_fee' column in the source table maps directly to the 'Fee' column in the target table with no transformation needed.
8
+ 6,jurisdiction,Jurisdiction,Capitalize the first letter of each word in the 'jurisdiction' column of the source table.,"The 'jurisdiction' column in the source table maps to the 'Jurisdiction' column in the target table, with capitalization needed."
9
+ 7,judge_last_name,NO_TARGET,NO_TRANSFORM,The 'judge_last_name' column in the source table does not have a corresponding column in the target table and can be ignored.