The cleaning process consists of: - Joining the separate splits together / add split column - Converting string messages into list of structs - Removing empty system prompts
Fascinating point from @thomwolf at Web Summit: AI misuse (deepfakes, fake news) is actually easier to make with closed models, not with open-source ones.
This challenges the common narrative that open-source AI is inherently more dangerous. The reality is more nuanced - while we may think open source is technically easier to misuse, closed models' accessibility and product-focused design appear to be driving more actual harm.
Important context for current AI safety discussions and regulation debates.