Freeman Lewin's picture

Freeman Lewin

EmetTheGolum
·

AI & ML interests

Data and Data Aquisition

Recent Activity

Organizations

Emet 's profile picture

EmetTheGolum's activity

reacted to cfahlgren1's post with ❤️ about 1 month ago
view post
Post
3107
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
·