davanstrien HF staff commited on
Commit
d287b55
1 Parent(s): 4dc4bf5

Update README.md with notebooks for creating synthetic data for training sentence similarity models

Browse files
notebooks/README.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # Table of Contents
2
+
3
+ ## Creating data for training sentence similarity models
4
+
5
+ These notebooks demonstrate how to create synthetic data for training sentence similarity models.
6
+
7
+ - [01_dataset_preparation](notebooks/01_dataset_preperation.ipynb) covers the initial processing steps to prepare a dataset for the synthetic dataset creation. This notebook uses [LlamaIndex](https://docs.llamaindex.ai/en/stable/) to chunk texts into sections that will serve as inputs for creating a synthetic dataset.
8
+ [02_synthetic_data_creation.ipynb](notebooks/02_synthetic_data_creation.ipynb): covers synthetic data creation for training sentence similarity models. The notebook uses `Outlines` to generate structured data and `vLLM`` to run the LLM.
notebooks/sentence-similarity-datasets-creation/01_dataset_preperation.ipynb ADDED
@@ -0,0 +1,946 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "eca241b7",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Fine-tuning a custom Sentence Transformers model using synthetic data\n",
9
+ "\n",
10
+ "This notebook shows at a high level how we can define a pipeline for generating synthetic datasets for training/fine-tuning Sentence Transformers models for a custom domain using an LLM to help you generate relevant data.\n",
11
+ "\n",
12
+ "## Why fine-tune?\n",
13
+ "\n",
14
+ "There are already many good open source embedding models you can use but you may:\n",
15
+ "\n",
16
+ "- work in a specific domain where existing embeddings might not work super well\n",
17
+ "- have a specific concept of similarity you want to capture\n",
18
+ "- want to optimize for a particular task\n",
19
+ "\n",
20
+ "In all of these cases, even a little fine-tuning might help.\n",
21
+ "\n",
22
+ "## How to get custom data?\n",
23
+ "\n",
24
+ "One of the main barriers to fine-tuning a custom model has been the cost and effort involved in creating the datasets needed for this training. Recently, there has been an increased usage of LLMs for generating synthetic datasets. We'll see in this series of notebook how we can use an LLM for creating training datasets for fine-tuning a sentence similarity model.\n",
25
+ "\n",
26
+ "Before we start creating our dataset we do some initial exploration and prep of the dataset we're working with. \n"
27
+ ]
28
+ },
29
+ {
30
+ "cell_type": "markdown",
31
+ "id": "6177d35d",
32
+ "metadata": {},
33
+ "source": [
34
+ "<div style=\"border-left: 4px solid #00A000; background-color: #F0FFF0; padding: 10px; margin: 10px 0;\">\n",
35
+ " <strong>Tip:</strong> We focus on a particular dataset in this case but you should be able to fairly easily adapt the notebook to use any other dataset on the Hugging Face Hub. \n",
36
+ "</div>"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "id": "35e484b3",
42
+ "metadata": {},
43
+ "source": [
44
+ "If you are running this notebook in Collab you can use the following command to install the necessary libraries. If you are running in the Synthetic datasets workshop Space everything is already installed."
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "code",
49
+ "execution_count": null,
50
+ "id": "ed4c4b75",
51
+ "metadata": {},
52
+ "outputs": [],
53
+ "source": [
54
+ "#%pip install datasets>=2.18.0 llama_index rich"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "markdown",
59
+ "id": "dc1663d4",
60
+ "metadata": {},
61
+ "source": [
62
+ "## 01. Preparing the data\n",
63
+ "\n",
64
+ "In this notebook, we'll focus on exploring the dataset and preparing it for generating our synthetic data. Depending on how well you know your dataset already you might spend less time on this step. However, it's always good to have a look at the data before starting to generate synthetic data since the approach you'll take might depend on the data you have."
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "code",
69
+ "execution_count": 1,
70
+ "id": "d60d4d45-7eed-46cd-8404-5b645357daca",
71
+ "metadata": {
72
+ "tags": []
73
+ },
74
+ "outputs": [],
75
+ "source": [
76
+ "import random\n",
77
+ "import uuid\n",
78
+ "from multiprocessing import cpu_count\n",
79
+ "from typing import Any, Dict, Optional\n",
80
+ "\n",
81
+ "from datasets import load_dataset\n",
82
+ "from huggingface_hub import login\n",
83
+ "from llama_index.core import Document\n",
84
+ "from llama_index.core.node_parser import SentenceSplitter\n",
85
+ "from rich import print as rich_print"
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "code",
90
+ "execution_count": 2,
91
+ "id": "bce86058",
92
+ "metadata": {},
93
+ "outputs": [],
94
+ "source": [
95
+ "NUM_PROC = cpu_count()"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "markdown",
100
+ "id": "ca75b4b7",
101
+ "metadata": {},
102
+ "source": [
103
+ "## Authenticate with the Hub"
104
+ ]
105
+ },
106
+ {
107
+ "cell_type": "code",
108
+ "execution_count": 3,
109
+ "id": "787899d1",
110
+ "metadata": {},
111
+ "outputs": [
112
+ {
113
+ "data": {
114
+ "application/vnd.jupyter.widget-view+json": {
115
+ "model_id": "106ca7ce1b054e5ca761def20767c718",
116
+ "version_major": 2,
117
+ "version_minor": 0
118
+ },
119
+ "text/plain": [
120
+ "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
121
+ ]
122
+ },
123
+ "metadata": {},
124
+ "output_type": "display_data"
125
+ }
126
+ ],
127
+ "source": [
128
+ "login()"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "markdown",
133
+ "id": "5c9ae8bc",
134
+ "metadata": {},
135
+ "source": [
136
+ "## The dataset \n",
137
+ "\n",
138
+ "For this example, we'll use [`dreamproit/bill_summary_us`](https://huggingface.co/datasets/dreamproit/bill_summary_us). This dataset \"collects the text of bills, some metadata, as well as the CRS (Congressional Research Service) summaries\". The dataset is originally focused on helping to develop models for summarization but we'll use it to generate synthetic data for training a sentence similarity model.\n",
139
+ "\n",
140
+ "For datasets like this, we might see some benefits in fine-tuning a custom sentence similarity model. A standard sentence similarity model may do a good job of finding similar sentences but it might not be able to capture the specific similarity we're interested in. For example, it might be able to distinguish the text in US bills compared to recipes but if you want to find similar bills based on the topics they cover, you might need a more domain-specific model. Alongside being able to work with a specific domain, you may also want to have more control over what type of similarity you want a model to capture. Do you want the model to capture semantic similarity, topic similarity, or something else? If we create our own dataset and fine-tune a model we'll have more control over this."
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "markdown",
145
+ "id": "8b16380e",
146
+ "metadata": {},
147
+ "source": [
148
+ "Let's start by loading the data and having a look at it."
149
+ ]
150
+ },
151
+ {
152
+ "cell_type": "code",
153
+ "execution_count": 4,
154
+ "id": "15c06307-8c33-418a-8c5d-5490ca6b9b85",
155
+ "metadata": {
156
+ "tags": []
157
+ },
158
+ "outputs": [
159
+ {
160
+ "data": {
161
+ "text/plain": [
162
+ "Dataset({\n",
163
+ " features: ['id', 'congress', 'bill_type', 'bill_number', 'bill_version', 'sections', 'sections_length', 'text', 'text_length', 'summary', 'summary_length', 'title'],\n",
164
+ " num_rows: 125246\n",
165
+ "})"
166
+ ]
167
+ },
168
+ "execution_count": 4,
169
+ "metadata": {},
170
+ "output_type": "execute_result"
171
+ }
172
+ ],
173
+ "source": [
174
+ "ds = load_dataset(\"dreamproit/bill_summary_us\", split=\"train\")\n",
175
+ "ds"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "markdown",
180
+ "id": "4c5cb13d",
181
+ "metadata": {},
182
+ "source": [
183
+ "Let's see a few examples of the data we have in the dataset."
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": 5,
189
+ "id": "2bea8515",
190
+ "metadata": {},
191
+ "outputs": [
192
+ {
193
+ "data": {
194
+ "text/plain": [
195
+ "{'id': ['108hconres408ih', '108hconres449ih'],\n",
196
+ " 'congress': [108, 108],\n",
197
+ " 'bill_type': ['hconres', 'hconres'],\n",
198
+ " 'bill_number': [408, 449],\n",
199
+ " 'bill_version': ['ih', 'ih'],\n",
200
+ " 'sections': [[{'text': 'That Congress— (1) congratulates the University of Denver men’s hockey team for winning the 2004 NCAA men’s hockey national championship; (2) recognizes the achievements of all the team’s players, coaches, and support staff and invites them to the United States Capitol Building to be honored; (3) requests that the President recognize the achievements of the University of Denver men’s hockey team and invite the team members to the White House for an appropriate ceremony honoring a national championship team; and (4) directs the Clerk of the House of Representatives to make available to the University of Denver enrolled copies of this resolution for appropriate display and to transmit an enrolled copy of this resolution to each coach and member of the 2004 NCAA men’s hockey national championship team.',\n",
201
+ " 'id': 'H1B567796F9E54E5292B7F63447088900',\n",
202
+ " 'header': None},\n",
203
+ " {'text': '', 'id': 'HF099927D1BD242418B7FBD40C05801B9', 'header': None}],\n",
204
+ " [{'text': 'That Congress — (1) honors the life and accomplishments of Ray Charles Robinson; (2) recognizes Ray Charles for his invaluable contributions to the Nation, the State of Georgia, and the American musical heritage; and (3) extends condolences to the family of Ray Charles on the death of a remarkable man.',\n",
205
+ " 'id': 'HEF171422A60C4F7A8C672FF7928C0715',\n",
206
+ " 'header': None}]],\n",
207
+ " 'sections_length': [2, 1],\n",
208
+ " 'text': ['That Congress— (1) congratulates the University of Denver men’s hockey team for winning the 2004 NCAA men’s hockey national championship; (2) recognizes the achievements of all the team’s players, coaches, and support staff and invites them to the United States Capitol Building to be honored; (3) requests that the President recognize the achievements of the University of Denver men’s hockey team and invite the team members to the White House for an appropriate ceremony honoring a national championship team; and (4) directs the Clerk of the House of Representatives to make available to the University of Denver enrolled copies of this resolution for appropriate display and to transmit an enrolled copy of this resolution to each coach and member of the 2004 NCAA men’s hockey national championship team. ',\n",
209
+ " 'That Congress — (1) honors the life and accomplishments of Ray Charles Robinson; (2) recognizes Ray Charles for his invaluable contributions to the Nation, the State of Georgia, and the American musical heritage; and (3) extends condolences to the family of Ray Charles on the death of a remarkable man.'],\n",
210
+ " 'text_length': [811, 303],\n",
211
+ " 'summary': [\"(This measure has not been amended since it was introduced. The summary of that version is repeated here.)\\n\\nCongratulates the University of Denver men's hockey team for winning the 2004 NCAA men's hockey national championship, recognizes the achievements of all the team's players, coaches, and support staff, and invites them to the U.S. Capitol Building to be honored.\\n\\nRequests that the President recognize the achievements of the University of Denver men's hockey team and invite the team members to the White House for an appropriate ceremony honoring a national championship team.\",\n",
212
+ " '(This measure has not been amended since it was introduced. The summary of that version is repeated here.)\\n\\nHonors the life and accomplishments of Ray Charles Robinson. Recognizes his invaluable contributions to the Nation, the State of Georgia, and the American musical heritage. Extends condolences to his family on his death.'],\n",
213
+ " 'summary_length': [586, 328],\n",
214
+ " 'title': [\"Congratulating the University of Denver men's hockey team for winning the 2004 NCAA men's hockey national championship, and for other purposes.\",\n",
215
+ " 'Honoring the life and accomplishments of Ray Charles, recognizing his contributions to the Nation, and extending condolences to his family on his death.']}"
216
+ ]
217
+ },
218
+ "execution_count": 5,
219
+ "metadata": {},
220
+ "output_type": "execute_result"
221
+ }
222
+ ],
223
+ "source": [
224
+ "ds[:2]"
225
+ ]
226
+ },
227
+ {
228
+ "cell_type": "markdown",
229
+ "id": "d501618e",
230
+ "metadata": {},
231
+ "source": [
232
+ "Since we're going to be working to create an embedding dataset for the texts in the `text` column, let's take a closer look at what this looks like."
233
+ ]
234
+ },
235
+ {
236
+ "cell_type": "code",
237
+ "execution_count": 6,
238
+ "id": "41bd5eb2",
239
+ "metadata": {},
240
+ "outputs": [
241
+ {
242
+ "data": {
243
+ "text/html": [
244
+ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
245
+ " <span style=\"color: #008000; text-decoration-color: #008000\">'That Congress— (1) recognizes and celebrates the abolition of slavery more than 150 years ago in the Latin </span>\n",
246
+ "<span style=\"color: #008000; text-decoration-color: #008000\">American countries of Mexico, Chile, Uruguay, Bolivia, Colombia, Ecuador, Argentina, Peru, and Venezuela; (2) </span>\n",
247
+ "<span style=\"color: #008000; text-decoration-color: #008000\">recognizes the social, political, and cultural contributions of enslaved blacks and their descendants in Latin </span>\n",
248
+ "<span style=\"color: #008000; text-decoration-color: #008000\">America; (3) acknowledges the impact of slavery and the existence of racial discrimination that have led to </span>\n",
249
+ "<span style=\"color: #008000; text-decoration-color: #008000\">disparate social conditions and lack of civil liberties in Latin America; (4) urges the United States Government to</span>\n",
250
+ "<span style=\"color: #008000; text-decoration-color: #008000\">work with the governments of Latin American countries to promote the visibility of the descendants of enslaved </span>\n",
251
+ "<span style=\"color: #008000; text-decoration-color: #008000\">blacks in such countries and to recognize the importance of supporting international and regional efforts to </span>\n",
252
+ "<span style=\"color: #008000; text-decoration-color: #008000\">eliminate racial and ethnic discrimination, such as the International Convention on the Elimination of All Forms of</span>\n",
253
+ "<span style=\"color: #008000; text-decoration-color: #008000\">Racial Discrimination (signed at New York on December 21, 1965); and (5) urges the countries of Latin America to </span>\n",
254
+ "<span style=\"color: #008000; text-decoration-color: #008000\">work with the United States and the international community to assist in addressing poverty and other targets in </span>\n",
255
+ "<span style=\"color: #008000; text-decoration-color: #008000\">accordance with the United Nations Millennium Development Goals (as contained in United Nations General Assembly </span>\n",
256
+ "<span style=\"color: #008000; text-decoration-color: #008000\">Resolution 55/2 (September 2000)).'</span>,\n",
257
+ " <span style=\"color: #008000; text-decoration-color: #008000\">'That it is the sense of Congress that— (1) the United States should support the principles of democracy and </span>\n",
258
+ "<span style=\"color: #008000; text-decoration-color: #008000\">constitutional rule in the Republic of Haiti, under which President Jean-Bertrand Aristide was elected, and oppose </span>\n",
259
+ "<span style=\"color: #008000; text-decoration-color: #008000\">any and all attempts to remove President Aristide from office prior to the completion of his term under the </span>\n",
260
+ "<span style=\"color: #008000; text-decoration-color: #008000\">Constitution of Haiti; (2) the United States should condemn the violent activities of groups of thugs, former </span>\n",
261
+ "<span style=\"color: #008000; text-decoration-color: #008000\">members of Haiti’s disbanded army, and paramilitary organizations in Haiti; and (3) the United States, working with</span>\n",
262
+ "<span style=\"color: #008000; text-decoration-color: #008000\">the United Nations, the Organization of American States (OAS), and other countries, should immediately provide </span>\n",
263
+ "<span style=\"color: #008000; text-decoration-color: #008000\">assistance to Haiti to strengthen, reinforce, and professionalize the Haitian police force in order to enable the </span>\n",
264
+ "<span style=\"color: #008000; text-decoration-color: #008000\">Haitian police force to restore law and order and preserve democracy in Haiti.'</span>\n",
265
+ "<span style=\"font-weight: bold\">]</span>\n",
266
+ "</pre>\n"
267
+ ],
268
+ "text/plain": [
269
+ "\u001b[1m[\u001b[0m\n",
270
+ " \u001b[32m'That Congress— \u001b[0m\u001b[32m(\u001b[0m\u001b[32m1\u001b[0m\u001b[32m)\u001b[0m\u001b[32m recognizes and celebrates the abolition of slavery more than 150 years ago in the Latin \u001b[0m\n",
271
+ "\u001b[32mAmerican countries of Mexico, Chile, Uruguay, Bolivia, Colombia, Ecuador, Argentina, Peru, and Venezuela; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m2\u001b[0m\u001b[32m)\u001b[0m\u001b[32m \u001b[0m\n",
272
+ "\u001b[32mrecognizes the social, political, and cultural contributions of enslaved blacks and their descendants in Latin \u001b[0m\n",
273
+ "\u001b[32mAmerica; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m3\u001b[0m\u001b[32m)\u001b[0m\u001b[32m acknowledges the impact of slavery and the existence of racial discrimination that have led to \u001b[0m\n",
274
+ "\u001b[32mdisparate social conditions and lack of civil liberties in Latin America; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m4\u001b[0m\u001b[32m)\u001b[0m\u001b[32m urges the United States Government to\u001b[0m\n",
275
+ "\u001b[32mwork with the governments of Latin American countries to promote the visibility of the descendants of enslaved \u001b[0m\n",
276
+ "\u001b[32mblacks in such countries and to recognize the importance of supporting international and regional efforts to \u001b[0m\n",
277
+ "\u001b[32meliminate racial and ethnic discrimination, such as the International Convention on the Elimination of All Forms of\u001b[0m\n",
278
+ "\u001b[32mRacial Discrimination \u001b[0m\u001b[32m(\u001b[0m\u001b[32msigned at New York on December 21, 1965\u001b[0m\u001b[32m)\u001b[0m\u001b[32m; and \u001b[0m\u001b[32m(\u001b[0m\u001b[32m5\u001b[0m\u001b[32m)\u001b[0m\u001b[32m urges the countries of Latin America to \u001b[0m\n",
279
+ "\u001b[32mwork with the United States and the international community to assist in addressing poverty and other targets in \u001b[0m\n",
280
+ "\u001b[32maccordance with the United Nations Millennium Development Goals \u001b[0m\u001b[32m(\u001b[0m\u001b[32mas contained in United Nations General Assembly \u001b[0m\n",
281
+ "\u001b[32mResolution 55/2 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mSeptember 2000\u001b[0m\u001b[32m)\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.'\u001b[0m,\n",
282
+ " \u001b[32m'That it is the sense of Congress that— \u001b[0m\u001b[32m(\u001b[0m\u001b[32m1\u001b[0m\u001b[32m)\u001b[0m\u001b[32m the United States should support the principles of democracy and \u001b[0m\n",
283
+ "\u001b[32mconstitutional rule in the Republic of Haiti, under which President Jean-Bertrand Aristide was elected, and oppose \u001b[0m\n",
284
+ "\u001b[32many and all attempts to remove President Aristide from office prior to the completion of his term under the \u001b[0m\n",
285
+ "\u001b[32mConstitution of Haiti; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m2\u001b[0m\u001b[32m)\u001b[0m\u001b[32m the United States should condemn the violent activities of groups of thugs, former \u001b[0m\n",
286
+ "\u001b[32mmembers of Haiti’s disbanded army, and paramilitary organizations in Haiti; and \u001b[0m\u001b[32m(\u001b[0m\u001b[32m3\u001b[0m\u001b[32m)\u001b[0m\u001b[32m the United States, working with\u001b[0m\n",
287
+ "\u001b[32mthe United Nations, the Organization of American States \u001b[0m\u001b[32m(\u001b[0m\u001b[32mOAS\u001b[0m\u001b[32m)\u001b[0m\u001b[32m, and other countries, should immediately provide \u001b[0m\n",
288
+ "\u001b[32massistance to Haiti to strengthen, reinforce, and professionalize the Haitian police force in order to enable the \u001b[0m\n",
289
+ "\u001b[32mHaitian police force to restore law and order and preserve democracy in Haiti.'\u001b[0m\n",
290
+ "\u001b[1m]\u001b[0m\n"
291
+ ]
292
+ },
293
+ "metadata": {},
294
+ "output_type": "display_data"
295
+ }
296
+ ],
297
+ "source": [
298
+ "rich_print(ds[4:6]['text'])"
299
+ ]
300
+ },
301
+ {
302
+ "cell_type": "markdown",
303
+ "id": "df7129ec",
304
+ "metadata": {},
305
+ "source": [
306
+ "We can see these texts are relatively short but if we take a look at other examples in this dataset we'll see there are some much longer ones. For most datasets we're working with which haven't already been preprocessed in some way, we'll find that we need to do some work to split the texts into smaller segments. "
307
+ ]
308
+ },
309
+ {
310
+ "cell_type": "markdown",
311
+ "id": "1c8a813d",
312
+ "metadata": {},
313
+ "source": [
314
+ "## Chunking our text\n",
315
+ "\n",
316
+ "We'll need to split our text into smaller chunks to be able to use it for training a sentence similarity model. There are two main reasons for this:\n",
317
+ "\n",
318
+ "- Sentence Transformers models have a maximum input length for text/tokens they can process. This number depends on the model you're using. \n",
319
+ "- Longer sections of text are more likely to be about multiple topics which can make it harder for the model to learn a specific type of similarity.\n",
320
+ "\n",
321
+ "Whilst the maximum embedding size for many open source models has grown recently we may still want to split our text into smaller chunks to ensure we have logical units of text to work with.\n",
322
+ "\n",
323
+ "### How to decide on the right chunk size\n",
324
+ "\n",
325
+ "Deciding on the right chunk size can be a bit of a balancing act and can depend on the specific dataset you're working with and the end application for your embedding model. One of the main applications of a custom sentence similarity model is to help improve the performance of a Retrieval Augmented Generation (RAG) application. In this case, you might want to split your text into chunks that are similar in length to the passages you'll be working with in your RAG application. \n",
326
+ "\n",
327
+ "\n",
328
+ "### Splitting with Llama-index\n",
329
+ "\n",
330
+ "There are many libraries that have been developed for helping with RAG applications that can also help us with splitting our text into chunks. One of these is `Llama-index` which we'll use in this notebook.\n",
331
+ "\n",
332
+ "LLama-index has many different approaches for splitting texts (see [node_parsers](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/)). In this notebook we'll use the rather simple `SentenceSplitter` which splits text into sentences:\n",
333
+ "\n",
334
+ ">In general, this class tries to keep sentences and paragraphs together. Therefore compared to the original TokenTextSplitter, there are less likely to be hanging sentences or parts of sentences at the end of the node chunk.\n",
335
+ "\n",
336
+ "If your data is in a format like HTML or Markdown, other parsers are likely to be worth exploring. There is also a `SemanticSplitterNodeParser` which \"splits a document into Nodes, with each node being a group of semantically related sentences.\". This could be worth exploring but is more computationally expensive to use and depending on the text you are working with might not lead to much better results.\n",
337
+ "\n",
338
+ "### What size should we split our text into?\n",
339
+ "\n",
340
+ "If we look at the doc string for `SentenceSplitter` we can see that the default value for `max_tokens` is `1024`. We might want to adjust this to see what size makes sense for our data. \n"
341
+ ]
342
+ },
343
+ {
344
+ "cell_type": "code",
345
+ "execution_count": 7,
346
+ "id": "cba67505",
347
+ "metadata": {},
348
+ "outputs": [
349
+ {
350
+ "name": "stdout",
351
+ "output_type": "stream",
352
+ "text": [
353
+ "\u001b[0;31mInit signature:\u001b[0m\n",
354
+ "\u001b[0mSentenceSplitter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
355
+ "\u001b[0;34m\u001b[0m \u001b[0mseparator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m' '\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
356
+ "\u001b[0;34m\u001b[0m \u001b[0mchunk_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1024\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
357
+ "\u001b[0;34m\u001b[0m \u001b[0mchunk_overlap\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m200\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
358
+ "\u001b[0;34m\u001b[0m \u001b[0mtokenizer\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCallable\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
359
+ "\u001b[0;34m\u001b[0m \u001b[0mparagraph_separator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'\\n\\n\\n'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
360
+ "\u001b[0;34m\u001b[0m \u001b[0mchunking_tokenizer_fn\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCallable\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
361
+ "\u001b[0;34m\u001b[0m \u001b[0msecondary_chunking_regex\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'[^,.;。?!]+[,.;。?!]?'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
362
+ "\u001b[0;34m\u001b[0m \u001b[0mcallback_manager\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mllama_index\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcallbacks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbase\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mCallbackManager\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
363
+ "\u001b[0;34m\u001b[0m \u001b[0minclude_metadata\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
364
+ "\u001b[0;34m\u001b[0m \u001b[0minclude_prev_next_rel\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
365
+ "\u001b[0;34m\u001b[0m \u001b[0mid_func\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCallable\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mllama_index\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDocument\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
366
+ "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
367
+ "\u001b[0;31mDocstring:\u001b[0m \n",
368
+ "Parse text with a preference for complete sentences.\n",
369
+ "\n",
370
+ "In general, this class tries to keep sentences and paragraphs together. Therefore\n",
371
+ "compared to the original TokenTextSplitter, there are less likely to be\n",
372
+ "hanging sentences or parts of sentences at the end of the node chunk.\n",
373
+ "\u001b[0;31mInit docstring:\u001b[0m Initialize with parameters.\n",
374
+ "\u001b[0;31mFile:\u001b[0m ~/Documents/tutorials/space/synthetic-data-workshop/.venv/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py\n",
375
+ "\u001b[0;31mType:\u001b[0m ModelMetaclass\n",
376
+ "\u001b[0;31mSubclasses:\u001b[0m "
377
+ ]
378
+ }
379
+ ],
380
+ "source": [
381
+ "?SentenceSplitter"
382
+ ]
383
+ },
384
+ {
385
+ "cell_type": "code",
386
+ "execution_count": 8,
387
+ "id": "ba57172b",
388
+ "metadata": {},
389
+ "outputs": [
390
+ {
391
+ "data": {
392
+ "text/plain": [
393
+ "(1024, 200)"
394
+ ]
395
+ },
396
+ "execution_count": 8,
397
+ "metadata": {},
398
+ "output_type": "execute_result"
399
+ }
400
+ ],
401
+ "source": [
402
+ "splitter = SentenceSplitter()\n",
403
+ "splitter.chunk_size, splitter.chunk_overlap"
404
+ ]
405
+ },
406
+ {
407
+ "cell_type": "markdown",
408
+ "id": "d51435a3",
409
+ "metadata": {},
410
+ "source": [
411
+ "Let's load an example text and see how different sizes of chunks look."
412
+ ]
413
+ },
414
+ {
415
+ "cell_type": "code",
416
+ "execution_count": 9,
417
+ "id": "14bdace0",
418
+ "metadata": {},
419
+ "outputs": [],
420
+ "source": [
421
+ "doc = Document.from_dict({\"text\": ds[200]['text']})"
422
+ ]
423
+ },
424
+ {
425
+ "cell_type": "code",
426
+ "execution_count": 10,
427
+ "id": "b0f2861c",
428
+ "metadata": {},
429
+ "outputs": [
430
+ {
431
+ "data": {
432
+ "text/plain": [
433
+ "[TextNode(id_='b77fc036-8f99-415c-8142-f9a672f450bc', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e59781d0-ad81-4411-9415-697b650e3d7e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='19c0d72fca6cf938248ff70d34f7af953ce06be4b57f1748652d7fa878b87185'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='5f7e3f35-a6dd-481b-968a-c70b2717c3d5', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='45627c584bb296bcb3c49c428092399194c37c6726210acedd5b22530af6d5a8')}, text='1. Short title; table of contents \\n(a) Short title \\nThis Act may be cited as the National Forest Organizational Camp Fee Improvement Act of 2003. (b) Table of contents \\nThe table of contents for this Act is as follows: Sec. 1. Short title; table of contents Sec. 2. Findings, purpose, and definitions Sec. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational camps Sec. 4. Implementation Sec. 5. Relationship to other laws Sec. 6. Deposit and expenditure of use fees Sec. 7. Ministerial issuance or amendment authorization 2. Findings, purpose, and definitions \\n(a) Findings \\nCongress finds the following: (1) Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, and faith-based and community-based organizations, provide a valuable service to young people, individuals with a disability, and their families by promoting physical, mental, and spiritual health through activities conducted in a natural environment. (2) The 192,000,0000 acres of national forests and grasslands of the National Forest System managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. (3) The Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such organizational camps that, while based on the fair market value of the land in use, also recognize the benefits provided to society by such organizational camps, do not preclude the ability of such organizational camps from utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational camps or their sponsoring organizations. (4) Organizational camps should— (A) ensure that their facilities meet applicable building and safety codes, including fire and health codes; (B) have annual inspections as required by local law, including at a minimum inspections for fire and food safety; and (C) have in place safety plans that address fire and medical emergencies and encounters with wildlife. (b) Purpose \\nIt is the purpose of this Act to establish a land use fee system that provides for an equitable return to the Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve young people or individuals with a disability. (c) Definitions \\nIn this Act: (1) The term organizational camp means a public or semi-public camp that— (A) is developed on National Forest System lands by a nonprofit organization or governmental entity; (B) provides a valuable service to the public by using such lands as a setting to introduce young people or individuals with a disability to activities that they may not otherwise experience and to educate them on natural resource issues; and (C) does not have as its primary purpose raising revenue through commercial activities. (2) The term Secretary means the Secretary of Agriculture, acting through the Chief of the Forest Service. (3) The term individual with a disability has the meaning given the term in section 7 of the Rehabilitation Act of 1973 (29 U.S.C. 705). (4) The term children at risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as parental drug abuse, homelessness, or child abuse. (5) The term change in control means— (A) in the case of a corporation, the sale or transfer of a controlling interest in the corporation; (B) in the case of a partnership or limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability company; and (C) in the case of an individual, the sale or transfer of an organizational camp to another party. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational camps \\n(a) Land use fee \\n(1) Percentage of land value \\nThe Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of National Forest System lands equal to five percent of the product of the following: (A) The total number of acres of National Forest System lands authorized for the organizational camp. (B) The estimated per-acre market value of land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture conducted by the National Agricultural Statistics Service. (2) Annual adjustment \\nThe land use fee determined under paragraph (1) for an organizational camp shall be adjusted annually by the annual compounded rate of change between the two most recent Censuses of Agriculture. (3) Reduction in fees \\n(A) Based on type of participants \\nThe Secretary shall reduce the land use fee determined under paragraph (1) for an organizational camp if the organizational camp is attended by individuals with a disability or children at risk.', start_char_idx=0, end_char_idx=4828, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
434
+ " TextNode(id_='5f7e3f35-a6dd-481b-968a-c70b2717c3d5', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e59781d0-ad81-4411-9415-697b650e3d7e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='19c0d72fca6cf938248ff70d34f7af953ce06be4b57f1748652d7fa878b87185'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='b77fc036-8f99-415c-8142-f9a672f450bc', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='bc3a3fa910ae2a4b931df0affbcb1e9ceb8a31876ea173f0411d363da53cb335'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='0191012f-76d7-46b3-bdd3-ba84581f04c3', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='d935674e66f4f56c1e706195b7d0e764369299c171aadf0e091a8167e1daecf3')}, text=\"(B) The estimated per-acre market value of land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture conducted by the National Agricultural Statistics Service. (2) Annual adjustment \\nThe land use fee determined under paragraph (1) for an organizational camp shall be adjusted annually by the annual compounded rate of change between the two most recent Censuses of Agriculture. (3) Reduction in fees \\n(A) Based on type of participants \\nThe Secretary shall reduce the land use fee determined under paragraph (1) for an organizational camp if the organizational camp is attended by individuals with a disability or children at risk. The amount of the reduction for a year shall bear the same ratio to the land use fee determined under paragraph (1) for the organizational camp as the total number of individuals with a disability and children at risk who attend the organizational camp bears to the total number of individuals who attend the organizational camp for the year. (B) Based on type of programs \\nAfter making the reduction required by subparagraph (A), the Secretary shall also reduce the land use fee determined under paragraph (1) for an organizational camp if the organizational camp provides youth programs for individuals attending the camp consisting of organized and supervised social, citizenship, character-building, or faith-based activities oriented to outdoor-recreation experiences. The amount of the reduction for a year shall be equal to 60 percent of the land use fee determined under paragraph (1), as adjusted under subparagraph (A). (C) Relation to minimum fee \\nNotwithstanding subparagraphs (A) and (B), the reductions made under this paragraph may not reduce the land use fee for an organizational camp below the minimum land use fee required to be charged under paragraph (4). (D) Special considerations \\nFor purposes of determining the amount of the land use fee reduction required under subparagraph (A) or (B), the Secretary may not take into consideration the existence of sponsorships or scholarships to assist individuals in attending the organizational camp. (4) Minimum land use fee \\nThe Secretary shall charge a minimum land use fee under paragraph (1) that represents, on average, the Secretary's cost annually to administer an organizational camp special use authorization in the National Forest Region in which the organizational camp is located. Notwithstanding paragraph (3) or subsection (d), the minimum land use fee shall not be subject to a reduction or waiver. (b) Facility use fee \\n(1) Percentage of facilities value \\nIf an organizational camp uses a Government-owned facility on National Forest System lands pursuant to section 7 of the Act of April 24, 1950 (commonly known as the Granger-Thye Act; 16 U.S.C. 580d), the Secretary shall charge, in addition to the land use fee imposed under subsection (a), a facility use fee equal to five percent of the value of the authorized facilities, as determined by the Secretary. (2) Reduction in fees prohibited \\nNotwithstanding subsection (d), the facility use fees determined under paragraph (1) shall not be subject to a reduction or waiver. (c) Fee related to receipt of other revenues \\nIf an organizational camp derives revenue from the use of National Forest System lands or authorized facilities described in subsection (b) for purposes other than to introduce young people or individuals with a disability to activities that they may not otherwise experience and to educate them on natural resource issues, the Secretary shall charge, in addition to the land use fee imposed under subsection (a) and the facility use fee imposed under subsection (b), an additional fee equal to five percent of that revenue. (d) Work-in-lieu program \\nSubject to subsections (a)(4) and (b)(2), section 3 of the Federal Timber Contract Payment Modification Act (16 U.S.C. 539f) shall apply to the use fees imposed under this section. 4. Implementation \\n(a) Prompt implementation \\nThe Secretary shall issue direction regarding implementation of this Act by interim directive within 180 days after the date of the enactment of this Act. The Secretary shall implement this Act beginning with the first billing cycle for organizational camp special use authorizations occurring more than 180 days after the date of the enactment of this Act. (b) Phase-in of use fee increases \\nIn issuing any direction regarding implementation of this Act under subsection (a), the Secretary shall consider whether to phase-in any significant increases in annual land or facility use fees for organizational camps. 5. Relationship to other laws \\nExcept as specifically provided by this Act, nothing in this Act supersedes or otherwise affects any provision of law, regulation, or policy regarding the issuance or administration of authorizations for organizational camps regarding the occupancy and use of National Forest System lands. 6. Deposit and expenditure of use fees \\n(a) Deposit and availability \\nUnless subject to section 7 of the Act of April 24, 1950 (commonly known as the Granger-Thye Act; 16 U.S.C.\", start_char_idx=4143, end_char_idx=9275, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
435
+ " TextNode(id_='0191012f-76d7-46b3-bdd3-ba84581f04c3', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e59781d0-ad81-4411-9415-697b650e3d7e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='19c0d72fca6cf938248ff70d34f7af953ce06be4b57f1748652d7fa878b87185'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='5f7e3f35-a6dd-481b-968a-c70b2717c3d5', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='45627c584bb296bcb3c49c428092399194c37c6726210acedd5b22530af6d5a8')}, text=\"The Secretary shall implement this Act beginning with the first billing cycle for organizational camp special use authorizations occurring more than 180 days after the date of the enactment of this Act. (b) Phase-in of use fee increases \\nIn issuing any direction regarding implementation of this Act under subsection (a), the Secretary shall consider whether to phase-in any significant increases in annual land or facility use fees for organizational camps. 5. Relationship to other laws \\nExcept as specifically provided by this Act, nothing in this Act supersedes or otherwise affects any provision of law, regulation, or policy regarding the issuance or administration of authorizations for organizational camps regarding the occupancy and use of National Forest System lands. 6. Deposit and expenditure of use fees \\n(a) Deposit and availability \\nUnless subject to section 7 of the Act of April 24, 1950 (commonly known as the Granger-Thye Act; 16 U.S.C. 580d), use fees collected by the Secretary under this Act shall be deposited in a special account in the Treasury and shall remain available to the Secretary for expenditure, without further appropriation until expended, for the purposes described in subsection (c). (b) Transfer \\nUpon request of the Secretary, the Secretary of the Treasury shall transfer to the Secretary from the special account such amounts as the Secretary may request. The Secretary shall accept and use such amounts in accordance with subsection (c). (c) Use \\nUse fees deposited pursuant to subsection (a) and transferred to the Secretary under subsection (b) shall be expended for monitoring of Forest Service special use authorizations, administration of the Forest Service's special program, interpretive programs, environmental analysis, environmental restoration, and similar purposes. 7. Ministerial issuance or amendment authorization \\n(a) NEPA exception \\nThe ministerial issuance or amendment of an organizational camp special use authorization shall not be subject to the National Environmental Policy Act of 1969 (42 U.S.C. 4321 et seq.). (b) Rule of construction \\nFor purposes of subsection (a), the ministerial issuance or amendment of an authorization occurs only when the issuance or amendment of the authorization would not change the physical environment or the activities, facilities, or program of the operations governed by the authorization, and at least one of the following apply: (1) The authorization is issued upon a change in control of the holder of an existing authorization. (2) The holder, upon expiration of an authorization, is issued a new authorization. (3) The authorization is amended— (A) to effectuate administrative changes, such as modification of the land use fee or conversion to a new special use authorization form; or (B) to include nondiscretionary environmental standards or to conform with current law.\", start_char_idx=8318, end_char_idx=11200, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]"
436
+ ]
437
+ },
438
+ "execution_count": 10,
439
+ "metadata": {},
440
+ "output_type": "execute_result"
441
+ }
442
+ ],
443
+ "source": [
444
+ "splits = splitter.get_nodes_from_documents([doc])\n",
445
+ "splits"
446
+ ]
447
+ },
448
+ {
449
+ "cell_type": "code",
450
+ "execution_count": 11,
451
+ "id": "d6375311",
452
+ "metadata": {},
453
+ "outputs": [
454
+ {
455
+ "data": {
456
+ "text/html": [
457
+ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. Short title; table of contents \n",
458
+ "<span style=\"font-weight: bold\">(</span>a<span style=\"font-weight: bold\">)</span> Short title \n",
459
+ "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2003</span>. <span style=\"font-weight: bold\">(</span>b<span style=\"font-weight: bold\">)</span> Table of contents\n",
460
+ "The table of contents for this Act is as follows: Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. Short title; table of contents Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span>. Findings, purpose,\n",
461
+ "and definitions Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Fees for occupancy and use of National Forest System lands and facilities by organizational\n",
462
+ "camps Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>. Implementation Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5</span>. Relationship to other laws Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">6</span>. Deposit and expenditure of use fees Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">7</span>.\n",
463
+ "Ministerial issuance or amendment authorization <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span>. Findings, purpose, and definitions \n",
464
+ "<span style=\"font-weight: bold\">(</span>a<span style=\"font-weight: bold\">)</span> Findings \n",
465
+ "Congress finds the following: <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span> Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, \n",
466
+ "and faith-based and community-based organizations, provide a valuable service to young people, individuals with a \n",
467
+ "disability, and their families by promoting physical, mental, and spiritual health through activities conducted in \n",
468
+ "a natural environment. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span><span style=\"font-weight: bold\">)</span> The <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">192</span>,<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">000</span>,<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0000</span> acres of national forests and grasslands of the National Forest System \n",
469
+ "managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span><span style=\"font-weight: bold\">)</span> The \n",
470
+ "Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such \n",
471
+ "organizational camps that, while based on the fair market value of the land in use, also recognize the benefits \n",
472
+ "provided to society by such organizational camps, do not preclude the ability of such organizational camps from \n",
473
+ "utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational\n",
474
+ "camps or their sponsoring organizations. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span><span style=\"font-weight: bold\">)</span> Organizational camps should— <span style=\"font-weight: bold\">(</span>A<span style=\"font-weight: bold\">)</span> ensure that their facilities meet \n",
475
+ "applicable building and safety codes, including fire and health codes; <span style=\"font-weight: bold\">(</span>B<span style=\"font-weight: bold\">)</span> have annual inspections as required by \n",
476
+ "local law, including at a minimum inspections for fire and food safety; and <span style=\"font-weight: bold\">(</span>C<span style=\"font-weight: bold\">)</span> have in place safety plans that \n",
477
+ "address fire and medical emergencies and encounters with wildlife. <span style=\"font-weight: bold\">(</span>b<span style=\"font-weight: bold\">)</span> Purpose \n",
478
+ "It is the purpose of this Act to establish a land use fee system that provides for an equitable return to the \n",
479
+ "Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve \n",
480
+ "young people or individuals with a disability. <span style=\"font-weight: bold\">(</span>c<span style=\"font-weight: bold\">)</span> Definitions \n",
481
+ "In this Act: <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span> The term organizational camp means a public or semi-public camp that— <span style=\"font-weight: bold\">(</span>A<span style=\"font-weight: bold\">)</span> is developed on National\n",
482
+ "Forest System lands by a nonprofit organization or governmental entity; <span style=\"font-weight: bold\">(</span>B<span style=\"font-weight: bold\">)</span> provides a valuable service to the \n",
483
+ "public by using such lands as a setting to introduce young people or individuals with a disability to activities \n",
484
+ "that they may not otherwise experience and to educate them on natural resource issues; and <span style=\"font-weight: bold\">(</span>C<span style=\"font-weight: bold\">)</span> does not have as its\n",
485
+ "primary purpose raising revenue through commercial activities. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span><span style=\"font-weight: bold\">)</span> The term Secretary means the Secretary of \n",
486
+ "Agriculture, acting through the Chief of the Forest Service. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span><span style=\"font-weight: bold\">)</span> The term individual with a disability has the \n",
487
+ "meaning given the term in section <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">7</span> of the Rehabilitation Act of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1973</span> <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">29</span> U.S.C. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">705</span><span style=\"font-weight: bold\">)</span>. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span><span style=\"font-weight: bold\">)</span> The term children at \n",
488
+ "risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as \n",
489
+ "parental drug abuse, homelessness, or child abuse. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5</span><span style=\"font-weight: bold\">)</span> The term change in control means— <span style=\"font-weight: bold\">(</span>A<span style=\"font-weight: bold\">)</span> in the case of a \n",
490
+ "corporation, the sale or transfer of a controlling interest in the corporation; <span style=\"font-weight: bold\">(</span>B<span style=\"font-weight: bold\">)</span> in the case of a partnership or\n",
491
+ "limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability \n",
492
+ "company; and <span style=\"font-weight: bold\">(</span>C<span style=\"font-weight: bold\">)</span> in the case of an individual, the sale or transfer of an organizational camp to another party. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. \n",
493
+ "Fees for occupancy and use of National Forest System lands and facilities by organizational camps \n",
494
+ "<span style=\"font-weight: bold\">(</span>a<span style=\"font-weight: bold\">)</span> Land use fee \n",
495
+ "<span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span> Percentage of land value \n",
496
+ "The Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of \n",
497
+ "National Forest System lands equal to five percent of the product of the following: <span style=\"font-weight: bold\">(</span>A<span style=\"font-weight: bold\">)</span> The total number of acres \n",
498
+ "of National Forest System lands authorized for the organizational camp. <span style=\"font-weight: bold\">(</span>B<span style=\"font-weight: bold\">)</span> The estimated per-acre market value of \n",
499
+ "land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture \n",
500
+ "conducted by the National Agricultural Statistics Service. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span><span style=\"font-weight: bold\">)</span> Annual adjustment \n",
501
+ "The land use fee determined under paragraph <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span> for an organizational camp shall be adjusted annually by the annual\n",
502
+ "compounded rate of change between the two most recent Censuses of Agriculture. <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span><span style=\"font-weight: bold\">)</span> Reduction in fees \n",
503
+ "<span style=\"font-weight: bold\">(</span>A<span style=\"font-weight: bold\">)</span> Based on type of participants \n",
504
+ "The Secretary shall reduce the land use fee determined under paragraph <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span> for an organizational camp if the \n",
505
+ "organizational camp is attended by individuals with a disability or children at risk.\n",
506
+ "</pre>\n"
507
+ ],
508
+ "text/plain": [
509
+ "\u001b[1;36m1\u001b[0m. Short title; table of contents \n",
510
+ "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Short title \n",
511
+ "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of \u001b[1;36m2003\u001b[0m. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Table of contents\n",
512
+ "The table of contents for this Act is as follows: Sec. \u001b[1;36m1\u001b[0m. Short title; table of contents Sec. \u001b[1;36m2\u001b[0m. Findings, purpose,\n",
513
+ "and definitions Sec. \u001b[1;36m3\u001b[0m. Fees for occupancy and use of National Forest System lands and facilities by organizational\n",
514
+ "camps Sec. \u001b[1;36m4\u001b[0m. Implementation Sec. \u001b[1;36m5\u001b[0m. Relationship to other laws Sec. \u001b[1;36m6\u001b[0m. Deposit and expenditure of use fees Sec. \u001b[1;36m7\u001b[0m.\n",
515
+ "Ministerial issuance or amendment authorization \u001b[1;36m2\u001b[0m. Findings, purpose, and definitions \n",
516
+ "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Findings \n",
517
+ "Congress finds the following: \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, \n",
518
+ "and faith-based and community-based organizations, provide a valuable service to young people, individuals with a \n",
519
+ "disability, and their families by promoting physical, mental, and spiritual health through activities conducted in \n",
520
+ "a natural environment. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m The \u001b[1;36m192\u001b[0m,\u001b[1;36m000\u001b[0m,\u001b[1;36m0000\u001b[0m acres of national forests and grasslands of the National Forest System \n",
521
+ "managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m The \n",
522
+ "Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such \n",
523
+ "organizational camps that, while based on the fair market value of the land in use, also recognize the benefits \n",
524
+ "provided to society by such organizational camps, do not preclude the ability of such organizational camps from \n",
525
+ "utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational\n",
526
+ "camps or their sponsoring organizations. \u001b[1m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m)\u001b[0m Organizational camps should— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m ensure that their facilities meet \n",
527
+ "applicable building and safety codes, including fire and health codes; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m have annual inspections as required by \n",
528
+ "local law, including at a minimum inspections for fire and food safety; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m have in place safety plans that \n",
529
+ "address fire and medical emergencies and encounters with wildlife. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Purpose \n",
530
+ "It is the purpose of this Act to establish a land use fee system that provides for an equitable return to the \n",
531
+ "Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve \n",
532
+ "young people or individuals with a disability. \u001b[1m(\u001b[0mc\u001b[1m)\u001b[0m Definitions \n",
533
+ "In this Act: \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m The term organizational camp means a public or semi-public camp that— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m is developed on National\n",
534
+ "Forest System lands by a nonprofit organization or governmental entity; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m provides a valuable service to the \n",
535
+ "public by using such lands as a setting to introduce young people or individuals with a disability to activities \n",
536
+ "that they may not otherwise experience and to educate them on natural resource issues; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m does not have as its\n",
537
+ "primary purpose raising revenue through commercial activities. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m The term Secretary means the Secretary of \n",
538
+ "Agriculture, acting through the Chief of the Forest Service. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m The term individual with a disability has the \n",
539
+ "meaning given the term in section \u001b[1;36m7\u001b[0m of the Rehabilitation Act of \u001b[1;36m1973\u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m29\u001b[0m U.S.C. \u001b[1;36m705\u001b[0m\u001b[1m)\u001b[0m. \u001b[1m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m)\u001b[0m The term children at \n",
540
+ "risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as \n",
541
+ "parental drug abuse, homelessness, or child abuse. \u001b[1m(\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m)\u001b[0m The term change in control means— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m in the case of a \n",
542
+ "corporation, the sale or transfer of a controlling interest in the corporation; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m in the case of a partnership or\n",
543
+ "limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability \n",
544
+ "company; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m in the case of an individual, the sale or transfer of an organizational camp to another party. \u001b[1;36m3\u001b[0m. \n",
545
+ "Fees for occupancy and use of National Forest System lands and facilities by organizational camps \n",
546
+ "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Land use fee \n",
547
+ "\u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m Percentage of land value \n",
548
+ "The Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of \n",
549
+ "National Forest System lands equal to five percent of the product of the following: \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m The total number of acres \n",
550
+ "of National Forest System lands authorized for the organizational camp. \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m The estimated per-acre market value of \n",
551
+ "land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture \n",
552
+ "conducted by the National Agricultural Statistics Service. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m Annual adjustment \n",
553
+ "The land use fee determined under paragraph \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m for an organizational camp shall be adjusted annually by the annual\n",
554
+ "compounded rate of change between the two most recent Censuses of Agriculture. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m Reduction in fees \n",
555
+ "\u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m Based on type of participants \n",
556
+ "The Secretary shall reduce the land use fee determined under paragraph \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m for an organizational camp if the \n",
557
+ "organizational camp is attended by individuals with a disability or children at risk.\n"
558
+ ]
559
+ },
560
+ "metadata": {},
561
+ "output_type": "display_data"
562
+ }
563
+ ],
564
+ "source": [
565
+ "rich_print(splits[0].text)"
566
+ ]
567
+ },
568
+ {
569
+ "cell_type": "code",
570
+ "execution_count": 12,
571
+ "id": "4779af81",
572
+ "metadata": {},
573
+ "outputs": [],
574
+ "source": [
575
+ "splitter = SentenceSplitter(chunk_size=128, chunk_overlap=0)"
576
+ ]
577
+ },
578
+ {
579
+ "cell_type": "code",
580
+ "execution_count": 13,
581
+ "id": "6d026858",
582
+ "metadata": {},
583
+ "outputs": [
584
+ {
585
+ "data": {
586
+ "text/html": [
587
+ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. Short title; table of contents \n",
588
+ "<span style=\"font-weight: bold\">(</span>a<span style=\"font-weight: bold\">)</span> Short title \n",
589
+ "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2003</span>. <span style=\"font-weight: bold\">(</span>b<span style=\"font-weight: bold\">)</span> Table of contents\n",
590
+ "The table of contents for this Act is as follows: Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. Short title; table of contents Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span>. Findings, purpose,\n",
591
+ "and definitions Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span>. Fees for occupancy and use of National Forest System lands and facilities by organizational\n",
592
+ "camps Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>. Implementation Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5</span>. Relationship to other laws Sec. <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">6</span>.\n",
593
+ "</pre>\n"
594
+ ],
595
+ "text/plain": [
596
+ "\u001b[1;36m1\u001b[0m. Short title; table of contents \n",
597
+ "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Short title \n",
598
+ "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of \u001b[1;36m2003\u001b[0m. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Table of contents\n",
599
+ "The table of contents for this Act is as follows: Sec. \u001b[1;36m1\u001b[0m. Short title; table of contents Sec. \u001b[1;36m2\u001b[0m. Findings, purpose,\n",
600
+ "and definitions Sec. \u001b[1;36m3\u001b[0m. Fees for occupancy and use of National Forest System lands and facilities by organizational\n",
601
+ "camps Sec. \u001b[1;36m4\u001b[0m. Implementation Sec. \u001b[1;36m5\u001b[0m. Relationship to other laws Sec. \u001b[1;36m6\u001b[0m.\n"
602
+ ]
603
+ },
604
+ "metadata": {},
605
+ "output_type": "display_data"
606
+ }
607
+ ],
608
+ "source": [
609
+ "splits = splitter.get_nodes_from_documents([doc])\n",
610
+ "rich_print(splits[0].text)"
611
+ ]
612
+ },
613
+ {
614
+ "cell_type": "code",
615
+ "execution_count": 14,
616
+ "id": "763f90a9",
617
+ "metadata": {},
618
+ "outputs": [
619
+ {
620
+ "data": {
621
+ "text/plain": [
622
+ "20"
623
+ ]
624
+ },
625
+ "execution_count": 14,
626
+ "metadata": {},
627
+ "output_type": "execute_result"
628
+ }
629
+ ],
630
+ "source": [
631
+ "sample_size = 12\n",
632
+ "documents = [Document.from_dict({\"text\": ds[i]['text']}) for i in range(10)]\n",
633
+ "splits = splitter.get_nodes_from_documents(documents)\n",
634
+ "len(splits)\n",
635
+ "# uncomment to see the output\n",
636
+ "# for split in splits:\n",
637
+ "# rich_print(split.text)"
638
+ ]
639
+ },
640
+ {
641
+ "cell_type": "markdown",
642
+ "id": "1ed1a8cd",
643
+ "metadata": {},
644
+ "source": [
645
+ "For this particular dataset, since the texts are quite dense in the topics they cover, it seems to make sense to aim for a smaller chunk size like 128. This will help us to ensure that we're capturing the specific topics in the text. If you are using a different dataset you might want to experiment with different chunk sizes to see what works best for your data."
646
+ ]
647
+ },
648
+ {
649
+ "cell_type": "markdown",
650
+ "id": "41fb3f85",
651
+ "metadata": {},
652
+ "source": [
653
+ "## Process our full dataset\n",
654
+ "\n",
655
+ "Now that we've decided on a chunk size, let's process our full dataset. We'll split each text into chunks and save these to a new dataset."
656
+ ]
657
+ },
658
+ {
659
+ "cell_type": "code",
660
+ "execution_count": 15,
661
+ "id": "ada9122f-a505-4bc0-80f6-228be2067891",
662
+ "metadata": {
663
+ "tags": []
664
+ },
665
+ "outputs": [],
666
+ "source": [
667
+ "def split_texts(\n",
668
+ " examples: Dict[str, Any],\n",
669
+ " text_column_name: str = \"text\",\n",
670
+ " id_column_name: Optional[str] = None,\n",
671
+ " splitter: Optional[SentenceSplitter] = None,\n",
672
+ "):\n",
673
+ " if splitter is None:\n",
674
+ " # if not provided, use the default splitter\n",
675
+ " splitter = SentenceSplitter()\n",
676
+ " texts = examples[text_column_name]\n",
677
+ " if id_column_name is None:\n",
678
+ " # Generate random ids if not provided\n",
679
+ " ids = [str(uuid.uuid4()) for _ in range(len(texts))]\n",
680
+ " else:\n",
681
+ " ids = examples[id_column_name]\n",
682
+ " sections = []\n",
683
+ " ids_ = []\n",
684
+ " for text, id_ in zip(texts, ids):\n",
685
+ " # Create a document for each text\n",
686
+ " document = Document(text=text)\n",
687
+ " # Split the document into nodes\n",
688
+ " nodes = splitter.get_nodes_from_documents([document])\n",
689
+ " # Extract the text from each node\n",
690
+ " sentences = [n.text for n in nodes]\n",
691
+ " # Extend the sections list with these sentences\n",
692
+ " sections.extend(sentences)\n",
693
+ " # Extend the ids_ list with the corresponding id, repeated for each sentence\n",
694
+ " ids_.extend([id_] * len(sentences))\n",
695
+ " return {\"section\": sections, \"id\": ids_}"
696
+ ]
697
+ },
698
+ {
699
+ "cell_type": "markdown",
700
+ "id": "d7221f82",
701
+ "metadata": {},
702
+ "source": [
703
+ "We can now split the full dataset. \n",
704
+ "\n",
705
+ "If you are using a different dataset remember to adjust the `text_column_name` if the name of the column containing the text for your dataset is different. If there is an `id` column you can specify that as well otherwise set this to `None` and the function will generate an id for each row."
706
+ ]
707
+ },
708
+ {
709
+ "cell_type": "code",
710
+ "execution_count": 16,
711
+ "id": "336c0f08",
712
+ "metadata": {},
713
+ "outputs": [],
714
+ "source": [
715
+ "splitter = SentenceSplitter(chunk_size=128, chunk_overlap=0)"
716
+ ]
717
+ },
718
+ {
719
+ "cell_type": "code",
720
+ "execution_count": 17,
721
+ "id": "3bb3807f-37ac-4e52-b37a-dc7ebc4a3446",
722
+ "metadata": {
723
+ "tags": []
724
+ },
725
+ "outputs": [
726
+ {
727
+ "data": {
728
+ "application/vnd.jupyter.widget-view+json": {
729
+ "model_id": "9aad33479c3348619a33327e168b5f45",
730
+ "version_major": 2,
731
+ "version_minor": 0
732
+ },
733
+ "text/plain": [
734
+ "Map (num_proc=8): 0%| | 0/125246 [00:00<?, ? examples/s]"
735
+ ]
736
+ },
737
+ "metadata": {},
738
+ "output_type": "display_data"
739
+ },
740
+ {
741
+ "data": {
742
+ "text/plain": [
743
+ "Dataset({\n",
744
+ " features: ['id', 'congress', 'bill_type', 'bill_number', 'bill_version', 'sections', 'sections_length', 'text', 'text_length', 'summary', 'summary_length', 'title'],\n",
745
+ " num_rows: 125246\n",
746
+ "})"
747
+ ]
748
+ },
749
+ "execution_count": 17,
750
+ "metadata": {},
751
+ "output_type": "execute_result"
752
+ }
753
+ ],
754
+ "source": [
755
+ "chunked_ds = ds.map(\n",
756
+ " split_texts,\n",
757
+ " batched=True,\n",
758
+ " num_proc=NUM_PROC,\n",
759
+ " remove_columns=list(ds.column_names),\n",
760
+ " fn_kwargs={\"text_column_name\": \"text\", \"id_column_name\": \"id\", \"splitter\": splitter},\n",
761
+ ")\n",
762
+ "ds"
763
+ ]
764
+ },
765
+ {
766
+ "cell_type": "code",
767
+ "execution_count": 18,
768
+ "id": "e1319f17",
769
+ "metadata": {},
770
+ "outputs": [
771
+ {
772
+ "data": {
773
+ "text/plain": [
774
+ "Dataset({\n",
775
+ " features: ['id', 'section'],\n",
776
+ " num_rows: 3446013\n",
777
+ "})"
778
+ ]
779
+ },
780
+ "execution_count": 18,
781
+ "metadata": {},
782
+ "output_type": "execute_result"
783
+ }
784
+ ],
785
+ "source": [
786
+ "chunked_ds"
787
+ ]
788
+ },
789
+ {
790
+ "cell_type": "code",
791
+ "execution_count": 19,
792
+ "id": "1c8c01ec",
793
+ "metadata": {},
794
+ "outputs": [
795
+ {
796
+ "data": {
797
+ "text/plain": [
798
+ "['(4) Coal \\nThe term coal means bituminous\\t\\t\\t\\tcoal, subbituminous coal, and lignite. (d) Aggregate\\t\\t\\t\\tcredits \\n(1) In\\t\\t\\t\\tgeneral \\nNo credit shall be allowed under this section with\\t\\t\\t\\trespect to any qualifying clean coal project unless such project is certified\\t\\t\\t\\tby the Secretary under subsection (e).',\n",
799
+ " '1. Short\\t\\t\\t title; table of contents \\n(a) Short\\t\\t\\t title \\nThis Act may be cited\\t\\t\\t as the Skilled Worker Immigration and\\t\\t\\t Fairness Act. (b) Table of\\t\\t\\t contents \\nThe table of contents for this Act is as follows: Sec. 1. Short title; table of\\t\\t\\t\\tcontents. Sec. 2. H–1B visas. Sec. 3. Employment-based immigration. Sec. 4. H–1B visa fraud and abuse protections. 2.',\n",
800
+ " 'Remote control locomotive use \\n(a) Prohibition \\nNo railroad carrier shall operate or cause to be operated on the general system of railroad transportation a remote control locomotive to carry hazardous materials. (b) Penalty \\n(1) A railroad carrier that knowingly violates this section or a rule issued under this section is liable to the United States Government for a civil penalty of at least $5,000 but not more than $50,000 for each violation.']"
801
+ ]
802
+ },
803
+ "execution_count": 19,
804
+ "metadata": {},
805
+ "output_type": "execute_result"
806
+ }
807
+ ],
808
+ "source": [
809
+ "sample_idx = random.sample(range(len(chunked_ds)),k=3)\n",
810
+ "chunked_ds.select(sample_idx)[:]['section']"
811
+ ]
812
+ },
813
+ {
814
+ "cell_type": "markdown",
815
+ "id": "f9f757c5",
816
+ "metadata": {},
817
+ "source": [
818
+ "## Pushing the data to the hub\n",
819
+ "\n",
820
+ "We can save the data locally to use in the next notebook but it's often easier to work with the data if we push it to the hub. This way we can easily access the data in the next notebook."
821
+ ]
822
+ },
823
+ {
824
+ "cell_type": "code",
825
+ "execution_count": 21,
826
+ "id": "93cb06cc",
827
+ "metadata": {},
828
+ "outputs": [
829
+ {
830
+ "data": {
831
+ "application/vnd.jupyter.widget-view+json": {
832
+ "model_id": "cbf35fcd72694d0e9e842daaccfd7a46",
833
+ "version_major": 2,
834
+ "version_minor": 0
835
+ },
836
+ "text/plain": [
837
+ "Uploading the dataset shards: 0%| | 0/4 [00:00<?, ?it/s]"
838
+ ]
839
+ },
840
+ "metadata": {},
841
+ "output_type": "display_data"
842
+ },
843
+ {
844
+ "data": {
845
+ "application/vnd.jupyter.widget-view+json": {
846
+ "model_id": "a06440fe1467488dadfb8a8e3cbda147",
847
+ "version_major": 2,
848
+ "version_minor": 0
849
+ },
850
+ "text/plain": [
851
+ "Creating parquet from Arrow format: 0%| | 0/862 [00:00<?, ?ba/s]"
852
+ ]
853
+ },
854
+ "metadata": {},
855
+ "output_type": "display_data"
856
+ },
857
+ {
858
+ "data": {
859
+ "application/vnd.jupyter.widget-view+json": {
860
+ "model_id": "33aaab93d1d74744a7a5bcdc14db7a18",
861
+ "version_major": 2,
862
+ "version_minor": 0
863
+ },
864
+ "text/plain": [
865
+ "Creating parquet from Arrow format: 0%| | 0/862 [00:00<?, ?ba/s]"
866
+ ]
867
+ },
868
+ "metadata": {},
869
+ "output_type": "display_data"
870
+ },
871
+ {
872
+ "data": {
873
+ "application/vnd.jupyter.widget-view+json": {
874
+ "model_id": "d2b8ac857f83403aa7f5be36e9a5b95d",
875
+ "version_major": 2,
876
+ "version_minor": 0
877
+ },
878
+ "text/plain": [
879
+ "Creating parquet from Arrow format: 0%| | 0/862 [00:00<?, ?ba/s]"
880
+ ]
881
+ },
882
+ "metadata": {},
883
+ "output_type": "display_data"
884
+ },
885
+ {
886
+ "data": {
887
+ "application/vnd.jupyter.widget-view+json": {
888
+ "model_id": "dfda42582a124c928a06726e6aebf5fd",
889
+ "version_major": 2,
890
+ "version_minor": 0
891
+ },
892
+ "text/plain": [
893
+ "Creating parquet from Arrow format: 0%| | 0/862 [00:00<?, ?ba/s]"
894
+ ]
895
+ },
896
+ "metadata": {},
897
+ "output_type": "display_data"
898
+ },
899
+ {
900
+ "data": {
901
+ "text/plain": [
902
+ "CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/bill_summary_us_chunks/commit/e9c23f8e002cda39422c1a39bc95c8e5cd37213b', commit_message='Upload dataset', commit_description='', oid='e9c23f8e002cda39422c1a39bc95c8e5cd37213b', pr_url=None, pr_revision=None, pr_num=None)"
903
+ ]
904
+ },
905
+ "execution_count": 21,
906
+ "metadata": {},
907
+ "output_type": "execute_result"
908
+ }
909
+ ],
910
+ "source": [
911
+ "chunked_ds.push_to_hub(\"davanstrien/bill_summary_us_chunks\")"
912
+ ]
913
+ },
914
+ {
915
+ "cell_type": "markdown",
916
+ "id": "c4f070a4",
917
+ "metadata": {},
918
+ "source": [
919
+ "## Next steps\n",
920
+ "\n",
921
+ "In the next notebook, we'll look at how we can use an LLM to generate synthetic data for fine-tuning our custom Sentence Transformers model. If you are running this notebook in the Synthetic Dataset Workshop Space you can find the next notebook in the workspace. If you are running this notebook locally you can find the next notebook in the Hugging Face repository. "
922
+ ]
923
+ }
924
+ ],
925
+ "metadata": {
926
+ "kernelspec": {
927
+ "display_name": "Python 3 (ipykernel)",
928
+ "language": "python",
929
+ "name": "python3"
930
+ },
931
+ "language_info": {
932
+ "codemirror_mode": {
933
+ "name": "ipython",
934
+ "version": 3
935
+ },
936
+ "file_extension": ".py",
937
+ "mimetype": "text/x-python",
938
+ "name": "python",
939
+ "nbconvert_exporter": "python",
940
+ "pygments_lexer": "ipython3",
941
+ "version": "3.11.1"
942
+ }
943
+ },
944
+ "nbformat": 4,
945
+ "nbformat_minor": 5
946
+ }
notebooks/sentence-similarity-datasets-creation/02_synthetic_data_creation.ipynb ADDED
The diff for this file is too large to render. See raw diff