Massive Text Embedding Benchmark

non-profit

https://github.com/embeddings-benchmark

embeddings-benchmark

Activity Feed

AI & ML interests

Massive Text Embeddings Benchmark

Recent Activity

Muennighoff updated a dataset about 1 hour ago

mteb/arena-results

orionweller updated a dataset about 12 hours ago

mteb/results

KennethEnevoldsen new activity about 19 hours ago

mteb/leaderboard:Why is MTEB unusable after UI upgrade?

View all activity

mteb's activity

Muennighoff

updated a dataset about 1 hour ago

mteb/arena-results

Viewer • Updated about 1 hour ago • 3.95k • 8.53k • 4

orionweller

updated a dataset about 12 hours ago

mteb/results

Updated about 12 hours ago • 8.96k • 1

KennethEnevoldsen

in mteb/leaderboard about 19 hours ago

Why is MTEB unusable after UI upgrade?

#158 opened 1 day ago by

people-search-assistant

Mina76

updated a dataset about 23 hours ago

mteb/music-genre

Viewer • Updated about 23 hours ago • 1.89k • 5

Samoed

in mteb/leaderboard 1 day ago

Why is MTEB unusable after UI upgrade?

#158 opened 1 day ago by

people-search-assistant

Missing Tasks?

#157 opened 2 days ago by

Chao985

KennethEnevoldsen

in mteb/leaderboard 2 days ago

General free-text feature encoder

#156 opened 2 days ago by

thomas-pocreau

Andrianos

authored 2 papers 5 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published 14 days ago • 31

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

Paper • 2409.12060 • Published Sep 18, 2024

dokato

authored 3 papers 9 days ago

vatolinalex

authored a paper 9 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published 14 days ago • 31

rasdani

authored a paper 9 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published 14 days ago • 31

mmhamdy

posted an update 11 days ago

Post

2710

🎉 We're excited to introduce MemoryCode, a novel synthetic dataset designed to rigorously evaluate LLMs' ability to track and execute coding instructions across multiple sessions. MemoryCode simulates realistic workplace scenarios where a mentee (the LLM) receives coding instructions from a mentor amidst a stream of both relevant and irrelevant information.

💡 But what makes MemoryCode unique?! The combination of the following:

✅ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers.

✅ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments.

✅ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information.

✅ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts.

✅ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application.

📌 Our Findings

1️⃣ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions.

2️⃣ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them.

🔗 Paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions (2502.13791)
📦 Code: https://github.com/for-ai/MemoryCode