davanstrien 's Collections

synthetic-data-generation-demos

A collection of demos for various approaches to synthetic data generation


  • Note Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.


  • Note Instruction pre-training is a new approach that enhances LLM pretraining by using instruction-response pairs from an instruction synthesizer instead of raw data.


  • Note Magpie is a data synthesis pipeline that creates high-quality alignment data without relying on prompt engineering or seed questions. Instead, it generates instruction data by prompting aligned LLMs with a pre-query template.


  • Note This is a demo for Bonito, an open-source model for conditional task generation, which involves converting unannotated text into task-specific synthetic instruction tuning data.