Insights on dataset

#2
by TahaKhan - opened

Hello @nmitchko ,

I am a data science beginner and would like to achieve something similar on a dataset I have:
I want to fine-tune guanaco on a specific product and technical documentation.
In order to do so, I am preparing a strategy to create, and/or collect high quality dataset for fine tuning.

For this purpose, I am looking to learn from experiences of people who have already done so and I found you while looking for people with experience of fine tuning guanaco.

I wanted to ask if you will be willing to share some examples from your dataset that you used in your fine tuning?
Would it be possible for you to please share some examples from each category of your dataset?

I will be happy to get any other insights that you would like to share.

Looking forward to your response. :)

Best,
Taha.

Hi @TahaKhan

Firstly, I fine-tune using Qlora. This let's me create a lora with limited hw through 4-bit inference and full-precision LoRAs.

As for datasets, you can use OpenAssistant to create a high quality dataset for your task.

On product documentation, you shouldn't finetune a model on your first go, instead, try using a context database first, and test the results against the base model. If that has limitations, then explore creating a finetuned model.

Here is an example of using a model without finetuning on PDF documentation. LangFlow is a great way to test this out. See the "PDF Loader" Flow.

As for the dataset format, here is what the input data looks like:

medconcat.json

[
    {"instruction": "Answer this question truthfully", 
     "input": "What to expect if I have Varicose veins  (Outlook/Prognosis)?", 
     "output": "Varicose veins tend to get worse over time. You can ease discomfort and slow varicose vein progression with self care."},
    { "instruction": "...." ...
]

During each training pass, each item in the dataset gets parsed into the alpaca instruction format:

instruction = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

and the training target is that the next tokens are equal to the {output} for each instruction tuned

Thank you @nmitchko for your insights, and your advices.

Please bear with me on a seemingly dumb question:

  • Why can I not use guanaco itself to create a high quality dataset? You proposed using OpenAssistant, there must be some wisdom behind it. Can you please tell me what that is? :D

Again, I thank you very much!

While you may think an AI model can autoencode your data for you, human input learning creates a much better dataset than just using guanaco alone.

Sign up or log in to comment