That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!
Details: π€ Based on ModernBERT-base with 149M parameters. π Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB! ποΈ Immediate FA2 and unpacking support for super efficient inference. πͺ Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256. β‘οΈ Maximum sequence length of 8192 tokens! 2οΈβ£ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets. β Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc. ποΈ Apache 2.0 licensed: fully commercially permissible
* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting. * Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases) * Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1 * Use the resulting data for SFT & RL * Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.
Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!