== Large Language Models (LLMs) are powerful, but they're prone to off-topic misuse, where users push them beyond their intended scope. Think harmful prompts, jailbreaks, and misuse. So how do we build better guardrails?
Traditional guardrails rely on curated examples or classifiers. The problem? ⚠️ High false-positive rates ⚠️ Poor adaptability to new misuse types ⚠️ Require real-world data, which is often unavailable during pre-production
Our method skips the need for real-world misuse examples. Instead, we: 1️⃣ Define the problem space qualitatively 2️⃣ Use an LLM to generate synthetic misuse prompts 3️⃣ Train and test guardrails on this dataset
We apply this to the off-topic prompt detection problem, and fine-tune simple bi- and cross-encoder classifiers that outperform heuristics based on cosine similarity or prompt engineering.
Additionally, framing the problem as prompt relevance allows these fine-tuned classifiers to generalise to other risk categories (e.g., jailbreak, toxic prompts).
Through this work, we also open-source our dataset (2M examples, ~50M+ tokens) and models.
Still following your human intuition to mix corpora from different sources for pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆?
Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄
🔬 In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈
nanoLLaVA-1.5 is here! Same size (1B), better performance 🔥🔥🔥 And it is much more powerful than v1.0 Try it out now on HF Spaces: qnguyen3/nanoLLaVA Model: qnguyen3/nanoLLaVA-1.5