I asked 8 LLMs to "Tell me a bedtime story about bears and waffles."
Claude 3.5 Sonnet and GPT-4o gave me the worst stories: no conflict, no moral, zero creativity.
In contrast, smaller models were quite creative and wrote stories involving talking waffle trees and bears ostracized for their love of waffles.
Here you can see a comparison between Claude 3.5 Sonnet and NeuralDaredevil-8B-abliterated. They both start with a family of bears but quickly diverge in terms of personality, conflict, etc.
I mapped it to the hero's journey to have some kind of framework. Prompt engineering can definitely help here, but it's still disappointing that the larger models don't create better stories right off the bat.
Do you know why smaller models outperform the frontier models here?
I wrote an article about abliteration and how NeuralDaredevil-8B was created. Beyond removing alignment, I believe it's an interesting technique with a lot of potential. It's basically fine-tuning without retraining.
In this article, we see how it works, implement it in Google Colab, and heal the abliterated model to recover the performance drop due to this technique. The final model is an uncensored and high-quality model with the highest MMLU score on the Open LLM Leaderboard (8B category).
π AutoMerger created the best 7B model on the Open LLM Leaderboard
By randomly combining top models from the Open LLM Leaderboard, AutoMerger created YamshadowExperiment28-7B. The model is three weeks old and has been at the top of the leaderboard for a week now. It was created through a simple SLERP merge of:
1/ On the Open LLM Leaderboard, it managed to outperform the excellent M7-7b model, which has been the #1 7B model for a while now.
2/ On the YALL leaderboard, YamshadowExperiment28-7B is ranked as the 9th best-performing automerge (but note that the scores are very close to each other). Compared to others, it does not perform particularly well on AGIEval or Bigbench.
3/ Thanks to @sam-paech , I have scores on EQ-Bench, where it managed to outperform all of my previous models. It even surpasses recent models such as DBRX instruct, Qwen1.5 32B Chat, and Cohere's Command R+.
Surprisingly, it does not support ChatML or Mistral Instruct, unlike my other merges (which are part of its family tree). Alpaca works well 99% of the time, but the model can sometimes produce a lot of "INST" tokens for no reason.
In my experiments, YamshadowExperiment28-7B doesn't seem smarter than other successful merges like AlphaMonarch. On the contrary, I found several mathematical or reasoning problems where it fails.
Considering these results, it looks like it might overfit the Open LLM Leaderboard. I guess it's anything but surprising when you randomly merge 156 models.
- GGUF: perfect for inference on CPUs (and LM Studio) - GPTQ/EXL2: fast inference on GPUs - AWQ: super fast inference on GPUs with vLLM (https://github.com/vllm-project/vllm) - HQQ: extreme quantization with decent 2-bit and 3-bit models
Once the model is converted, it automatically uploads it on the Hugging Face Hub. To quantize a 7B model, GGUF only needs a T4 GPU, while the other methods require an A100 GPU.
Merging models has become a powerful way to compress information and build powerful models for cheap. Right now, the process is still quite experimental: which models to merge? which parameters should I use? We have some intuition but no principled approach.
I made a little tool to make things a little clearer. It allows you to visualize the family tree of any model on the Hub. It also displays the type of license they use: permissive (green), noncommercial (red), and unknown (gray). It should help people select the right license based on the parent models.
In addition, I hope it can be refined to extract more information about these models: do models from very different branches work better when merged? Can we select them based on the weight difference? There are a lot of questions to explore in this new space. :)