Benchmark
How well does Cosmos remember? LongMemEval benchmark results
A transparent lab note on the current Cosmos retrieval stack. This page measures the shipping local BM25 path through SQLite FTS5 and MemoryStoreV2, with the benchmark limits spelled out instead of hidden.
Executive summary
Cosmos retrieves the right memory in the top 10 results for 95.2% of LongMemEval questions.
It reaches 88.1% recall@10 on LongMemEval_s using only the current local BM25 retrieval path.
At 10x accumulated memory, recall@10 remains 75.5%, showing where BM25 alone starts to lose ranking power.
Citation
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory by Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu.
Paper: Wu et al., ICLR 2025. Dataset: xiaowu0162/longmemeval.
End-to-end QA
70.8%
354 / 500 LongMemEval questions answered from retrieved context.
Retrieval completeness
88.1%
recall@10, with hit@10 at 95.2% on the full 500-question set.
Correct when retrieval misses
7.7%
Correct only 7.7% of the time when the right evidence is absent; the reader mostly abstains instead of guessing.
10x memory scale
75.5%
recall@10 on LongMemEval_m, where the haystack grows to about 500 sessions across a 150-question streamed run.
What the numbers say
The right memory usually lands near the top.
The strongest signal is retrieval, because those metrics are deterministic and do not depend on a judge model. QA is included as a practical end-to-end check, but it should be read with the model caveat below.
| metric | @1 | @3 | @5 | @10 | @20 |
|---|---|---|---|---|---|
| hit@k — needle found | 70.8% | 85.4% | 89.6% | 95.2% | 97.6% |
| recall@k — completeness | 43.9% | 73.0% | 80.1% | 88.1% | 93.2% |
| precision@k | 70.8% | 43.3% | 29.4% | 16.4% | 8.8% |
| ndcg@k | 70.8% | 72.0% | 74.8% | 78.0% | 79.6% |
Precision falls as k grows because each question has only about 1.9 gold sessions. MRR and hit@1 are the cleaner read on whether the memory is noisy.
What this proves
- Cosmos local BM25 retrieval can surface the right evidence for almost every LongMemEval_s question by top-10.
- Retrieval is not strongly biased toward recent evidence inside a fixed history.
- When retrieval misses, the answerer mostly abstains instead of inventing an answer.
What this does not claim
- It does not prove a literal 3-5 year memory horizon; LongMemEval does not contain multi-year calendar spans.
- It is not a claim about every possible retrieval architecture or ranking method.
- End-to-end QA percentages are judge-dependent. Retrieval metrics are the model-independent baseline.
Scale
At 10x memory volume, the fact is kept but harder to surface.
LongMemEval_m grows the haystack from roughly 50 sessions to roughly 500. This scale run uses a 150-question streamed benchmark slice. Local BM25 still retrieves useful context, but the needle moves lower as more lexically similar sessions compete.
hit@5
89.6% at 50 sessions
72.7%
hit@10
95.2% at 50 sessions
81.3%
recall@5
80.1% at 50 sessions
67.2%
recall@10
88.1% at 50 sessions
75.5%
End-to-end QA
A practical reader test, after retrieval.
Cosmos retrieves the top-5 sessions, an answer model reads only that retrieved context, and a separate judge model grades the answer against the gold response. The answerer never sees the gold answer.
Single-session user
92.9%
Knowledge update
92.3%
Single-session assistant
69.6%
Multi-session
62.4%
Temporal reasoning
58.7%
Preference
56.7%
Method
How the run was measured
Dataset: official LongMemEval from Hugging Face,xiaowu0162/longmemeval.
Retrieval scoring is deterministic: reset Cosmos, ingest each chat session as one memory, query with the question, then compare returned session ids to the benchmark gold answer sessions.
Engine under test: Cosmos MemoryStoreV2 + BM25Search, the same local BM25 path used by MCP memory search.
End-to-end QA was run on the full 500-question longmemeval_s set.
Answer + judge model: Claude Opus 4.8. This model is used only for the end-to-end QA score; the retrieval table above is model-independent.
Reproduce
Commands used for this benchmark
python3.11 -m benchmarks.longitudinal.longmemeval.harness --file s
python3.11 run_m.py
python3.11 qa_bundle.pyThe retrieval run is local and deterministic. The answer and judge model named in the method note affects the end-to-end QA score, so the public claim should be read from the model-independent retrieval metrics first.
FAQ
Questions this page should answer directly
These are the short answers search engines, AI clients, and skeptical readers usually want first.
Is the injected memory noisy?
Not much. The cleanest signals are hit@1, MRR, and the missed-retrieval behavior. When the right evidence is not retrieved, the reader is correct only 7.7% of the time and usually abstains instead of confidently inventing an answer.
Is the memory complete enough to be useful?
On LongMemEval_s, Cosmos reaches 95.2% hit@10 and 88.1% recall@10. That means the needed evidence is usually present in the top retrieved set using only the current local BM25 retrieval path.
How long is it remembered?
This benchmark does not prove a literal multi-year calendar horizon. What it does show is that older memories inside the benchmark remain retrievable and that, when memory volume grows by roughly 10x, recall@10 falls from 88.1% to 75.5% because BM25 alone has a harder ranking problem.
Related
Follow the evidence into the rest of the product site
These links help both people and crawlers connect the benchmark to the product, docs, trust pages, and machine-readable summary.
Want the product view?
This benchmark is one piece of the story. Cosmos also captures project lessons, runtime problems, and code structure for AI tools.