Benchmark

How well does Cosmos remember? LongMemEval benchmark results

A transparent lab note on the current Cosmos retrieval stack. This page measures the shipping local BM25 path through SQLite FTS5 and MemoryStoreV2, with the benchmark limits spelled out instead of hidden.

LongMemEval · Wu et al., ICLR 2025500 questions · about 50 sessions per haystackLocal BM25 retrieval

Executive summary

Cosmos retrieves the right memory in the top 10 results for 95.2% of LongMemEval questions.

It reaches 88.1% recall@10 on LongMemEval_s using only the current local BM25 retrieval path.

At 10x accumulated memory, recall@10 remains 75.5%, showing where BM25 alone starts to lose ranking power.

Citation

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory by Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu.

Paper: Wu et al., ICLR 2025. Dataset: xiaowu0162/longmemeval.

End-to-end QA

70.8%

354 / 500 LongMemEval questions answered from retrieved context.

Retrieval completeness

88.1%

recall@10, with hit@10 at 95.2% on the full 500-question set.

Correct when retrieval misses

7.7%

Correct only 7.7% of the time when the right evidence is absent; the reader mostly abstains instead of guessing.

10x memory scale

75.5%

recall@10 on LongMemEval_m, where the haystack grows to about 500 sessions across a 150-question streamed run.

What the numbers say

The right memory usually lands near the top.

The strongest signal is retrieval, because those metrics are deterministic and do not depend on a judge model. QA is included as a practical end-to-end check, but it should be read with the model caveat below.

metric	@1	@3	@5	@10	@20
hit@k — needle found	70.8%	85.4%	89.6%	95.2%	97.6%
recall@k — completeness	43.9%	73.0%	80.1%	88.1%	93.2%
precision@k	70.8%	43.3%	29.4%	16.4%	8.8%
ndcg@k	70.8%	72.0%	74.8%	78.0%	79.6%

Precision falls as k grows because each question has only about 1.9 gold sessions. MRR and hit@1 are the cleaner read on whether the memory is noisy.

What this proves

Cosmos local BM25 retrieval can surface the right evidence for almost every LongMemEval_s question by top-10.
Retrieval is not strongly biased toward recent evidence inside a fixed history.
When retrieval misses, the answerer mostly abstains instead of inventing an answer.

What this does not claim

It does not prove a literal 3-5 year memory horizon; LongMemEval does not contain multi-year calendar spans.
It is not a claim about every possible retrieval architecture or ranking method.
End-to-end QA percentages are judge-dependent. Retrieval metrics are the model-independent baseline.

Scale

At 10x memory volume, the fact is kept but harder to surface.

LongMemEval_m grows the haystack from roughly 50 sessions to roughly 500. This scale run uses a 150-question streamed benchmark slice. Local BM25 still retrieves useful context, but the needle moves lower as more lexically similar sessions compete.

hit@5

89.6% at 50 sessions

72.7%

hit@10

95.2% at 50 sessions

81.3%

recall@5

80.1% at 50 sessions

67.2%

recall@10

88.1% at 50 sessions

75.5%

End-to-end QA

A practical reader test, after retrieval.

Cosmos retrieves the top-5 sessions, an answer model reads only that retrieved context, and a separate judge model grades the answer against the gold response. The answerer never sees the gold answer.

Single-session user

92.9%

Knowledge update

92.3%

Single-session assistant

69.6%

Multi-session

62.4%

Temporal reasoning

58.7%

Preference

56.7%

Method

How the run was measured

Dataset: official LongMemEval from Hugging Face,xiaowu0162/longmemeval.

Retrieval scoring is deterministic: reset Cosmos, ingest each chat session as one memory, query with the question, then compare returned session ids to the benchmark gold answer sessions.

Engine under test: Cosmos MemoryStoreV2 + BM25Search, the same local BM25 path used by MCP memory search.

End-to-end QA was run on the full 500-question longmemeval_s set.

Answer + judge model: Claude Opus 4.8. This model is used only for the end-to-end QA score; the retrieval table above is model-independent.

Reproduce

Commands used for this benchmark

benchmarks/longitudinal/longmemeval

python3.11 -m benchmarks.longitudinal.longmemeval.harness --file s
python3.11 run_m.py
python3.11 qa_bundle.py

The retrieval run is local and deterministic. The answer and judge model named in the method note affects the end-to-end QA score, so the public claim should be read from the model-independent retrieval metrics first.

FAQ

Questions this page should answer directly

These are the short answers search engines, AI clients, and skeptical readers usually want first.

Is the injected memory noisy?

Not much. The cleanest signals are hit@1, MRR, and the missed-retrieval behavior. When the right evidence is not retrieved, the reader is correct only 7.7% of the time and usually abstains instead of confidently inventing an answer.

Is the memory complete enough to be useful?

On LongMemEval_s, Cosmos reaches 95.2% hit@10 and 88.1% recall@10. That means the needed evidence is usually present in the top retrieved set using only the current local BM25 retrieval path.

How long is it remembered?

This benchmark does not prove a literal multi-year calendar horizon. What it does show is that older memories inside the benchmark remain retrievable and that, when memory volume grows by roughly 10x, recall@10 falls from 88.1% to 75.5% because BM25 alone has a harder ranking problem.

Follow the evidence into the rest of the product site

These links help both people and crawlers connect the benchmark to the product, docs, trust pages, and machine-readable summary.

Home Docs About Security AI summary AI summary file

Want the product view?

This benchmark is one piece of the story. Cosmos also captures project lessons, runtime problems, and code structure for AI tools.

Read the docs Claim launch offer