EverMemBench

A comprehensive benchmark for quantifying and diagnosing memory systems in large language models

📖 Project Description

EverMemBench is a benchmark designed to quantify and diagnose the memory systems of large language models. It introduces, for the first time, a three-tiered evaluation framework for memory systems consisting of: Factual Recall, Applied Memory, and Personalization Generalization.

This layered approach enables researchers to go beyond traditional retrieval-style evaluations and conduct fine-grained diagnostics of model capabilities, precisely locating performance bottlenecks in information extraction, contextual reasoning, or style adaptation. By offering a reproducible and standardized testing framework, EverMemBench not only reveals the significant shortcomings of current state-of-the-art models in achieving deep personalization, but also provides clear guidance for targeted optimization of memory systems.

🌟 Key Contributions

Progressive memory evaluation framework: We partition memory-system capabilities into three hierarchical layers — Factual Recall, Applied Memory, and Personalization Generalization — establishing a clear progression from pure retrieval to context integration to persona-consistent generation, thereby facilitating precise identification of performance bottlenecks.
Realistic and diagnostic long-horizon multi-party chat dataset: Grounded in real workplace communication scenarios, we construct a long-horizon corpus with a multi-role, multi-group, cross-context setting that explicitly models temporal persona drift and community-switching effects, enabling the assessment of memory robustness under concurrent topics and frequent context switches.
Unified quantification and standardized evaluation protocol: We provide consistent task formulations and measurement interfaces across the three core dimensions, supporting reproducible and comparable cross-model evaluation while reducing experimental bias in comparisons across systems and models.
Systematic cross-model empirical analysis: We comprehensively evaluate mainstream memory systems (e.g., MemOS, MemoryOS, Mem0, A-Mem) and state-of-the-art LLMs (e.g., GPT-4.5, GPT-4.1, Gemini-2.5-Pro), conducting side-by-side comparisons within a unified framework and revealing notable deficiencies in the memory capabilities of current advanced models.

🗂️ Benchmark Description

To systematically and reproducibly assess and diagnose LLM memory capabilities, we construct a long-horizon, multi-party group-chat dataset grounded in realistic workplace communication. The dataset centers on a “multi-role—multi-group—cross-context” communication setting, explicitly modeling the dynamism and context-dependence of individual profiles. In real work scenarios, a person’s behavior and communicative style may drift over time as conversations unfold; at the same time, the same individual may act differently across communities/teams due to role relations and power structures. For example, a department director may be more decisive and stern within a direct-report team chat, yet more restrained in a cross-department strategic group among peers. We embed such “time-varying” and “community-varying” personas and interaction patterns into the data construction process to faithfully reflect the complex and common communication ecology of enterprises.

Benefiting from this design, the dataset supports fine-grained and diagnostic evaluation of model memory systems under conditions of long conversations, concurrent topics, and frequent context switches. We summarize memory capability assessment along three core dimensions:

Fine-grained Detailed Recall. Tests retrieval ability, requiring the model to accurately reconstruct concrete facts from prior context.
Memory Awareness. Evaluates retrieval accompanied by understanding: the model must recall past events and integrate them to produce contextually appropriate answers.
User Profile Understanding. Focuses on personalization and adaptive generation. The model is expected to develop a stable understanding of individual preferences, roles, and tone based on historical interactions, and to adjust content and expression accordingly—avoiding replies that contradict the persona or are overly generic.

📊 Benchmark Data

Coming Soon...

🏗️ Benchmark Curation Pipeline

Coming Soon...

📈 Performance on EverMemBench

Based on EverMemBench, we conducted a comprehensive evaluation of mainstream memory systems (e.g., MemOS, MemoryOS, Mem0, A-Mem) and state-of-the-art LLMs (e.g., GPT-4.5, GPT-4.1, Gemini-2.5-Pro), performing standardized measurements and cross-model comparisons across three core dimensions.

License

MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
figures		figures
.DS_Store		.DS_Store
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EverMemBench

📖 Project Description

🌟 Key Contributions

🗂️ Benchmark Description

📊 Benchmark Data

🏗️ Benchmark Curation Pipeline

📈 Performance on EverMemBench

License

About

Uh oh!

Releases

Packages

EverMind-AI/EverMemBench

Folders and files

Latest commit

History

Repository files navigation

EverMemBench

📖 Project Description

🌟 Key Contributions

🗂️ Benchmark Description

📊 Benchmark Data

🏗️ Benchmark Curation Pipeline

📈 Performance on EverMemBench

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages