diff --git "a/notebooks/sentence-similarity-datasets-creation/02_synthetic_data_creation.ipynb" "b/notebooks/sentence-similarity-datasets-creation/02_synthetic_data_creation.ipynb" new file mode 100644--- /dev/null +++ "b/notebooks/sentence-similarity-datasets-creation/02_synthetic_data_creation.ipynb" @@ -0,0 +1,2050 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a51a3962-7801-4906-aeff-74fdfb805b3e", + "metadata": { + "tags": [] + }, + "source": [ + "
\n", + " Tip: To run the default model for this notebook, it's suggested to run on 4xL4 GPUs. You can also opt for a smaller GPU, but you must swap out the used model. If you want to run on a smaller GPU it's suggested to use the `meta-llama/Meta-Llama-3-8B-Instruct` model.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "d47bb232", + "metadata": {}, + "source": [ + "# Creating a synthetic dataset for fine-tuning a Sentence Transformer model \n", + "\n", + "In the previous notebook, we prepared a dataset that we'll use to generate our synthetic dataset. We'll now focus on generating a dataset that we can use to train or fine-tune a sentence similarity model using the Sentence Transformers library. \n", + "\n", + "As a reminder, the type of dataset format we're working towards creating has three columns: \"anchor\", \"positive\", and \"negative\". In our case, \"anchor\" is the text from the domain we want our Sentence Transformers model to work well with. The \"positive\" text is a text which should be similar in some way to the \"anchor\" text, whilst the \"negative\" should be in some way dissimilar to our original text.\n", + "\n", + "As a very simple example, if our \"anchor\" text is \"Bill 179 introduces restrictions on the use of pesticides in Canada\", a \"positive\" example might be \"a law that regulating the use of pesticides\", and a \"negative\" example might be \"a bill related to education funding\". Whilst it is possible to train a Sentence Transformer model using only positive examples, the inclusion of negative examples can help the model to learn more about the space of possible sentences and improve its performance. The [Sentence Transformers documentation](https://www.sbert.net/docs/sentence_transformer/training_overview.html) goes into more detail about this.\n", + "\n", + "## Creating positive and negative pairs using an LLM\n", + "\n", + "One of the remarkable aspects of working with large language models, especially instruction-tuned models, is the significant control and flexibility they offer in text generation. This unique capability empowers us to create a synthetic dataset for training a sentence similarity model. We can leverage our input anchor text alongside a prompt to generate the positive and negative examples for similarity, giving us full control over the dataset's composition.\n", + "\n", + "### What do we mean by similarity? \n", + "\n", + "How will the model be used in practice? Since we are fine-tuning a model and have a lot of control over the dataset, we should consider what we want similarity to look like. \n", + "\n", + "In an RAG use case, we might default to creating a prompt that's something like \"based on this text, write a user query that would be satisfied by this text\" Whilst that could make sense, for many RAG applications, the embeddings are used to give extra context to an LLM based on a user prompt. This user prompt might take the form of a query, but it's more likely to be a question rather than the query we make to a database. Depending on the use case, we might want to generate prompts that are more like the type of questions we expect the model to use." + ] + }, + { + "cell_type": "markdown", + "id": "6eedd4ef", + "metadata": {}, + "source": [ + "Let's start with our imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "fd71339b-7d22-4c9c-958e-f7b2f2ed2991", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import json\n", + "import random\n", + "\n", + "from datasets import concatenate_datasets, load_dataset\n", + "from huggingface_hub import DatasetCard, login\n", + "from outlines import generate, models\n", + "from pydantic import BaseModel, conlist, constr\n", + "from tqdm.auto import tqdm\n", + "from vllm import LLM, SamplingParams" + ] + }, + { + "cell_type": "markdown", + "id": "f47e5f95", + "metadata": {}, + "source": [ + "You can do this below if you haven't authenticated with the Hugging Face Hub yet." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5031cf30-3cfa-4bd9-ba9f-65bc52154f3f", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a5df2a83236b4aa094ed05b50c0f5cec", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HTML(value='
\n", + " Tip: If you are running on a single GPU update the `tensor_parallel_size` below to 1. \n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "02107e5b-22e2-4bd3-85ba-aad6246ba9d9", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/user/miniconda/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", + " warnings.warn(\n", + "2024-06-19 14:03:15,692\tINFO worker.py:1753 -- Started a local Ray instance.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO 06-19 14:03:16 config.py:623] Defaulting to use mp for distributed inference\n", + "INFO 06-19 14:03:16 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).\n", + "INFO 06-19 14:03:16 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='01-ai/Yi-1.5-34B-Chat', speculative_config=None, tokenizer='01-ai/Yi-1.5-34B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=01-ai/Yi-1.5-34B-Chat)\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:19 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:19 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:19 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n", + "INFO 06-19 14:03:20 utils.py:623] Found nccl from library libnccl.so.2\n", + "INFO 06-19 14:03:20 pynccl.py:65] vLLM is using nccl==2.20.5\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:20 utils.py:623] Found nccl from library libnccl.so.2\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:20 utils.py:623] Found nccl from library libnccl.so.2\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:20 utils.py:623] Found nccl from library libnccl.so.2\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:20 pynccl.py:65] vLLM is using nccl==2.20.5\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:20 pynccl.py:65] vLLM is using nccl==2.20.5\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:20 pynccl.py:65] vLLM is using nccl==2.20.5\n", + "WARNING 06-19 14:03:20 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m WARNING 06-19 14:03:20 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m WARNING 06-19 14:03:20 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n", + "WARNING 06-19 14:03:20 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Traceback (most recent call last):\n", + " File \"/home/user/miniconda/lib/python3.9/multiprocessing/resource_tracker.py\", line 201, in main\n", + " cache[rtype].remove(name)\n", + "KeyError: '/psm_e0bb0061'\n", + "Traceback (most recent call last):\n", + " File \"/home/user/miniconda/lib/python3.9/multiprocessing/resource_tracker.py\", line 201, in main\n", + " cache[rtype].remove(name)\n", + "KeyError: '/psm_e0bb0061'\n", + "Traceback (most recent call last):\n", + " File \"/home/user/miniconda/lib/python3.9/multiprocessing/resource_tracker.py\", line 201, in main\n", + " cache[rtype].remove(name)\n", + "KeyError: '/psm_e0bb0061'\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']\n", + "INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB\n", + "INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB\n", + "INFO 06-19 14:03:33 distributed_gpu_executor.py:56] # GPU blocks: 3187, # CPU blocks: 4369\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:42 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:03:42 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:42 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:03:42 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n", + "INFO 06-19 14:03:42 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n", + "INFO 06-19 14:03:42 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:42 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:03:42 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44274)\u001b[0;0m INFO 06-19 14:04:02 model_runner.py:954] Graph capturing finished in 21 secs.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44272)\u001b[0;0m INFO 06-19 14:04:02 model_runner.py:954] Graph capturing finished in 21 secs.\n", + "\u001b[1;36m(VllmWorkerProcess pid=44273)\u001b[0;0m INFO 06-19 14:04:02 model_runner.py:954] Graph capturing finished in 21 secs.\n", + "INFO 06-19 14:04:03 model_runner.py:954] Graph capturing finished in 21 secs.\n" + ] + } + ], + "source": [ + "llm = LLM(\n", + " \"01-ai/Yi-1.5-34B-Chat\",\n", + " tensor_parallel_size=4,\n", + " tokenizer=\"01-ai/Yi-1.5-34B-Chat\",\n", + " gpu_memory_utilization=0.9,\n", + " enable_chunked_prefill=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a479e701", + "metadata": {}, + "source": [ + "Below are some configs for other models you might want to experiment with. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "ef4266cf-f970-4c7a-b9db-503b9f584922", + "metadata": {}, + "outputs": [], + "source": [ + "# llm = LLM(\n", + "# \"meta-llama/Meta-Llama-3-8B-Instruct\",\n", + "# tensor_parallel_size=4,\n", + "# tokenizer=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n", + "# gpu_memory_utilization=0.9,\n", + "# enable_chunked_prefill=True,\n", + "# )" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c8e9995a-dfbd-4493-8906-3556be75f145", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# llm = LLM(\n", + "# \"Qwen/Qwen2-72B-Instruct-AWQ\",\n", + "# quantization=\"AWQ\",\n", + "# tensor_parallel_size=4,\n", + "# tokenizer=\"Qwen/Qwen2-72B-Instruct-AWQ\",\n", + "# gpu_memory_utilization=0.8,\n", + "# enable_chunked_prefill=True,\n", + "# )" + ] + }, + { + "cell_type": "markdown", + "id": "27c0774b-44d6-4828-8514-65fe8d52f906", + "metadata": { + "tags": [] + }, + "source": [ + "### Constrained Generation using Outlines\n" + ] + }, + { + "cell_type": "markdown", + "id": "f92b36eb-71ce-4301-a0d0-c7a47dce9cde", + "metadata": {}, + "source": [ + "Now we have created an LLM and loaded our dataset, we can move to the next step of creating our dataset. As a reminder we are looking to create a dataset with three columns: \"anchor\", \"positive\", and \"negative\" using the anchor text from our dataset as a starting point.\n", + "\n", + "We can do something like this as a prompt:\n", + "\n", + "\"Given the following text, write a sentence that is similar to the text. Write a sentence that is dissimilar to the text.\"\n", + "\n", + "This prompt (might!) generate a sentence that is similar to the anchor text and a sentence that is dissimilar to the anchor text. However, we may want some way to have more certainty that the LLM produces the text we want (at least in terms of format). For example if we want to generate valid JSON the first token generated should be `{`. \n", + "\n", + "Since we want to have two types of output \"good\" and \"bad\" aka positive and negative examples, we may to produce data that looks something like:\n", + "\n", + "{\"good\": \"a sentence that is similar to the text\", \"bad\": \"a sentence that is dissimilar to the text\"}\n", + "\n", + "\n", + "A strong LLM with a good prompt will probably already do this but we can also use a technique called guided generation to enforce this more strongly. This has a few advantages:\n", + "\n", + "- the outputs are what we expect i.e. valid JSON, this makes processing them in later steps easier\n", + "- we can have some additional control over other parts of the generation process\n", + "\n", + "For doing guided generation we'll use a library called [Outlines](https://github.com/outlines-dev/outlines). Outlines is a library that allows you to create constraints on the generation process.\n", + "\n", + "> Outlines〰 is a Python library that allows you to use Large Language Model in a simple and robust way (with structured generation).\n", + "\n", + "One of the advantages of Outlines over some other libraries is that it does structured generation by directly altering the behavior of the models generation process rather than using prompting and lots of retries. The Outlines library has an integration with `vLLM` which we used to load our model." + ] + }, + { + "cell_type": "markdown", + "id": "bed24b92-287a-4b2d-8807-2c8e031aba4c", + "metadata": {}, + "source": [ + "We can start by passing the llm we just created into the Outlines `models.VLLM` class" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "4be1bb3b-e665-4d40-aa24-bf8d8acf67f4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "model = models.VLLM(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "3f2fff3d", + "metadata": {}, + "source": [ + "### Defining the constraints\n", + "\n", + "We can now define the constraints we want to use. We can do this in various ways using the Outlines library but one nice way is to use a Pydantic Class to represent the output we want. If you are not familiar with Pydantic it is a\n", + "\n", + "> fast and extensible library for validating and serializing data using Python type hints.\n", + "\n", + "In our case we want to have two fields \"good\" and \"bad\" which are strings. We could define this as a Pydantic class like this:\n", + "\n", + "\n", + "```python\n", + "\n", + "from pydantic import BaseModel\n", + "\n", + "class Example(BaseModel):\n", + " good: str\n", + " bad: str\n", + "\n", + "```\n", + "\n", + "This will ensure we get a JSON object with two fields \"good\" and \"bad\" which are strings. However, when creating data for a sentence embedding task it can be useful to have multiple generations for each anchor text. In particular for the `bad` examples we may want the LLM to produce multiple bad examples and then choose the \"hard\" negative example i.e. the one that is most similar to the anchor text. This can help the model learn more quickly and deal with more complex examples.\n", + "\n", + "We can define this as a Pydantic class like this:\n", + "\n", + "```python\n", + "\n", + "from pydantic import BaseModel\n", + "from typing import List\n", + "\n", + "class Example(BaseModel):\n", + " good: str\n", + " bad: List[str]\n", + "\n", + "```\n", + "\n", + "You can see that we now have a list of strings for the \"bad\" field. This will allow us to generate multiple negative examples for each anchor text.\n", + "\n", + "In addition to controlling the structure of the JSON we can also define some additional constrains. For this example we'll create a Pydantic Class that has:\n", + "\n", + "- a good field which is a list of 1 string\n", + "- a bad field which is a list of 3 strings\n", + "- we also specify a constraint on the minimum and maximum length of the strings\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "3b96eb01-d186-442d-a15d-bbe170438dc2", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "class AbstractDescriptions(BaseModel):\n", + " good: conlist(constr(min_length=20, max_length=200), min_length=1, max_length=1) # type: ignore\n", + " bad: conlist(constr(min_length=20, max_length=200), min_length=3, max_length=3) # type: ignore\n", + "\n", + "\n", + "schema = AbstractDescriptions.model_json_schema()" + ] + }, + { + "cell_type": "markdown", + "id": "da0c6430", + "metadata": {}, + "source": [ + "Although it can be tempting to make a Pydantic Class that has a lot of constraints, it's often better to start with a simple class and then add more constraints as needed. Whilst Outlines will help guide the model to the output you want if it's very hard for an LLM to generate the output you want you may get suboptimal results." + ] + }, + { + "cell_type": "markdown", + "id": "720f6891-8650-4021-b992-dc98e0307075", + "metadata": {}, + "source": [ + "## Creating our prompt\n", + "\n", + "Now we turn to writing a prompt that will generate the data we want. In this case we write a prompt that is focused on the specific data we are working with (US legislation). We give the LLM a few shot example to help it understand the type of response we want as well as the format. \n", + "\n", + "In addition we pass in the constraints we defined above as a JSON Schema. This is another reason you may want to limit the number of constrains you have as the JSON Schema will get larger and in turn the size of your prompts will grow a lot. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "7695058d-487b-4db0-97e2-3d5ef5767f45", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def format_prompt(text: str) -> str:\n", + " return f\"\"\"Write one good and three bad abstract descriptions of the following text. Output the descriptions in a JSON file with keys ‘good’ and ‘bad’.\n", + "Example:\n", + "Text: Community Energy Savings Program Act of 2019\\n\\nThis bill directs the Department of Energy to establish a grant program for states and Indian tribes to provide loans to consumers and communities that want to implement cost-effective energy efficiency measures.\n", + "Good description: A legislative proposal to promote energy efficiency through financial incentives\n", + "Bad description: A federal reform relating to the process for submitting planning applications related to oil pipelines\n", + "\n", + "Note: Descriptions can vary in abstraction, detail, and focus. Both good and bad descriptions should be short (max 20 words)\n", + "\n", + "Text to describe: {text}\n", + "Return a JSON object with the keys 'good' and 'bad' using this schema: {schema}.\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "df05525c", + "metadata": {}, + "source": [ + "We not add a new column to our dataset called `prompt` which contains the prompt we want to use." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "206fbcfd-3cea-4540-8dd0-7ab0fba08254", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "2d609b2a30d0402b8d162f4a4f2723f0", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Map: 0%| | 0/1000 [00:00 includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). Given a batch of prompts and sampling parameters, this Class generates texts from the model using an intelligent batching mechanism and efficient memory management.\n", + "\n", + "While we could still work to optimize throughput further, this Class represents a good starting point for synthetic data generation. " + ] + }, + { + "cell_type": "markdown", + "id": "54810b5c-b7e2-43b6-ad2c-0c4bd4adb29c", + "metadata": {}, + "source": [ + "### Generating our dataset\n", + "\n", + "We are now ready to generate our dataset. We'll use the `generator` object to generate our data. We'll run our dataset as batches to allow us to retry generating data that fails. \n", + "\n", + "In the below `process_batch` function, we roughly do the following:\n", + "- Split our dataset into an initial batch size\n", + "- Try and generate a response for the batch\n", + "- If we have errors for the batch, we split the batch that failed into a smaller subset and retryß\n", + "- We go through this process with a maximum number of retries and a minimum batch size\n", + "\n", + "There are ways we could further optimize this, but this works quite well for not skipping a whole batch whilst not wasting too much compute/time retrying tiny batches to generate a bit of extra data. \n" + ] + }, + { + "cell_type": "markdown", + "id": "3c63a0bc", + "metadata": {}, + "source": [ + "We can specify some generation parameters that will be passed to vLLM when generating the data. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3a3cfec0-96f4-4a35-acc4-6b932238599e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "params = SamplingParams(n=1, max_tokens=800, best_of=2, temperature=0.7)" + ] + }, + { + "cell_type": "markdown", + "id": "299b08f1", + "metadata": {}, + "source": [ + "We can now generate our data. Feel free to adjust the `batch_size`, `max_retries`, and `min_batch_size` to suit your needs." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "b277db6b-29f6-467c-8512-2675076b1309", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9534747f9c0c4ef89eefac53b47039b3", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/10 [00:00= max_retries or batch_size <= min_batch_size:\n", + " print(\"Max retries reached or batch size too small. Skipping batch.\")\n", + " return []\n", + "\n", + " print(f\"Splitting batch and retrying (retry count: {retry_count + 1})\")\n", + " num_sub_batches = 2 ** (retry_count + 1)\n", + " sub_batch_size = max(batch_size // num_sub_batches, min_batch_size)\n", + " sub_batches = [\n", + " batch.shard(num_shards=num_sub_batches, index=i)\n", + " for i in range(num_sub_batches)\n", + " ]\n", + " print(f\"Sub batch size {sub_batch_size}\")\n", + " updated_sub_batches = []\n", + " for sub_batch in sub_batches:\n", + " updated_sub_batch = process_batch(\n", + " sub_batch, sub_batch_size, retry_count + 1, max_retries, min_batch_size\n", + " )\n", + " if len(updated_sub_batch) > 0:\n", + " updated_sub_batches.append(updated_sub_batch)\n", + "\n", + " return concatenate_datasets(updated_sub_batches) if updated_sub_batches else []\n", + "\n", + "\n", + "initial_batch_size = 100\n", + "num_batches = len(ds) // initial_batch_size + (len(ds) % initial_batch_size != 0)\n", + "dataset_parts = [ds.shard(num_shards=num_batches, index=i) for i in range(num_batches)]\n", + "\n", + "updated_parts = []\n", + "for part in tqdm(dataset_parts):\n", + " updated_part = process_batch(part, initial_batch_size)\n", + " if len(updated_part) > 0:\n", + " updated_parts.append(updated_part)\n", + "\n", + "if updated_parts:\n", + " ds = concatenate_datasets(updated_parts)\n", + "else:\n", + " print(\"No successfully updated parts. The dataset remains unchanged.\")" + ] + }, + { + "cell_type": "markdown", + "id": "6f7768ac", + "metadata": {}, + "source": [ + "If we look at our dataset we can see that we have a new column called `generations` which contains the generated data." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "490bf0b4-469b-4594-94a9-98916a47441e", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['id', 'section', 'prompt', 'generations'],\n", + " num_rows: 1000\n", + "})" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "457cc887-7931-43dc-a261-74e27fba2f28", + "metadata": {}, + "source": [ + "Let's look at an example row" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "66b0a713-b2f6-4668-ad37-72073bbd8397", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': '116s4049rs',\n", + " 'section': 'rear admiral in the Navy, or an equivalent grade in the Space Force is under investigation for alleged misconduct or pending the disposition of an adverse personnel action at the time of retirement, the Secretary of the military department concerned may— (A) conditionally determine the highest permanent grade of satisfactory service on active duty of the officer pending completion of the investigation or resolution of the personnel action, as applicable; and (B) retire the officer in that conditional grade, subject to subsection (e).',\n", + " 'prompt': \"Write one good and three bad abstract descriptions of the following text. Output the descriptions in a JSON file with keys ‘good’ and ‘bad’.\\nExample:\\nText: Community Energy Savings Program Act of 2019\\n\\nThis bill directs the Department of Energy to establish a grant program for states and Indian tribes to provide loans to consumers and communities that want to implement cost-effective energy efficiency measures.\\nGood description: A legislative proposal to promote energy efficiency through financial incentives\\nBad description: A federal reform relating to the process for submitting planning applications related to oil pipelines\\n\\nNote: Descriptions can vary in abstraction, detail, and focus. Both good and bad descriptions should be short (max 20 words)\\n\\nText to describe: rear admiral in the Navy, or an equivalent grade in the Space Force is under investigation for alleged misconduct or pending the disposition of an adverse personnel action at the time of retirement, the Secretary of the military department concerned may— (A) conditionally determine the highest permanent grade of satisfactory service on active duty of the officer pending completion of the investigation or resolution of the personnel action, as applicable; and (B) retire the officer in that conditional grade, subject to subsection (e).\\nReturn a JSON object with the keys 'good' and 'bad' using this schema: {'properties': {'good': {'items': {'maxLength': 200, 'minLength': 20, 'type': 'string'}, 'maxItems': 1, 'minItems': 1, 'title': 'Good', 'type': 'array'}, 'bad': {'items': {'maxLength': 200, 'minLength': 20, 'type': 'string'}, 'maxItems': 3, 'minItems': 3, 'title': 'Bad', 'type': 'array'}}, 'required': ['good', 'bad'], 'title': 'AbstractDescriptions', 'type': 'object'}.\",\n", + " 'generations': '{\"good\":[\"Retirement of rear admiral under investigation for misconduct may be conditional on the outcome of the investigation and resolution of personnel actions.\"],\"bad\":[\"A proposal to change the retirement benefits for military personnel based on their years of service.\",\"Legislation to implement new training programs for naval officers.\",\"Investigation process for harassment complaints in the military.\"]}'}" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds[0]" + ] + }, + { + "cell_type": "markdown", + "id": "4d048ff2-49bc-44fd-947b-facdc3977c85", + "metadata": {}, + "source": [ + "We'll do some more work to format the dataset in a format that is compatible with Sentence Transformers training APIs but let's already push to the raw dataset to the Hub. We can push this to a config called `raw`." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "ee860819-fb00-4795-a75f-63d5b7961dee", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ec47fd798c884451b528398b7a2de379", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Uploading the dataset shards: 0%| | 0/1 [00:00