← PLUR
Benchmark Report · March 23, 2026

Proven. 89% win rate.

When the answer depends on local knowledge, agents with PLUR memory win 89% of decided contests — across Haiku, Sonnet, and Opus. House rules: 100% win rate, zero losses. On LongMemEval, PLUR scores 86.7% — among the top-performing memory systems — with zero search cost and full data sovereignty.

The test

We gave the same task to Claude — with and without memory. Without it, your agent got house rules right 10–38% of the time, depending on model. With PLUR: 100%. Every model. Every run.

Not general intelligence. Not coding ability. Just: can your agent apply knowledge that only exists in your organization's memory? Tag conventions, file routing, deployment servers, which of your 100 tools handles trading. The answer is either in an engram or nowhere.

19 decisive contests · 3 Claude models · 31 wins · 4 losses · 89% win rate

Local knowledge benchmark

28 scenarios tested across Haiku 4.5, Sonnet 4.6, and Opus 4.5. Each scenario runs the same prompt through the agent twice: once with PLUR memory, once vanilla. Ties removed — only decisive contests count. 19 scenarios produced clear winners.

Knowledge type PLUR wins Losses Win rate What it tests
House rules 12 0 100% Project conventions, tag formats, file placement
Tool routing 10 2 83% Finding the right tool among 100+ options
Past experience 4 0 100% API quirks, debugging insights, infrastructure
Learned style 5 2 71% Communication tone, design preferences
General tasks 0 0 Zero penalty (control group)

House rules: 100% across every model, every run. Not a single loss. This is the unassailable claim. When your agent needs to know how things work here — tag conventions, file routing, DIP format, deployment patterns — PLUR gets it right. Every time.

Per-model breakdown

PLUR helps every model, but for different reasons. Cheaper models cannot explore — memory gives them navigation. Expensive models can explore — memory gives them things they cannot discover.

Model Win rate Record Notes
Haiku 4.5 90% 9W / 1L Cheapest model benefits most
Sonnet 4.6 91% 10W / 1L Most popular coding model
Opus 4.5 86% 12W / 2L Most capable model

PLUR vs vanilla by category and model

Category Haiku PLUR/Van. Adv. Sonnet PLUR/Van. Adv. Opus PLUR/Van. Adv.
House rules 100% / 10% 10.0x 100% / 26% 3.9x 100% / 38% 2.7x
Tool routing 83% / 39% 2.1x 83% / 39% 2.1x 59% / 42% 1.4x
Past experience 28% / 28% 1.0x 64% / 34% 1.9x 44% / 28% 1.6x
Learned style 56% / 51% 1.1x 61% / 63% 1.0x 71% / 50% 1.4x

The pattern: smarter models guess better on house rules (Haiku 10%, Opus 38%) but none get close to 100%. Memory isn't a reasoning crutch — it's information the model literally cannot infer.

The cost equation, reframed

The cheapest model with memory outperforms the most expensive without it.

Haiku 4.5 + PLUR

0.80 avg on discoverability. Cost: ~$1/run. The smallest, cheapest model — but it knows what tools exist because PLUR tells it.

Opus 4.5 alone

0.31 avg on discoverability. Cost: ~$10/run. The most capable model available — but it can't discover tools it has never seen.

Haiku with PLUR: 2.6x better at ~10x less cost. Instead of spending more on a bigger model, spend less on memory.

LongMemEval — retrieval benchmark

LongMemEval tests whether a memory system can correctly answer questions about past conversations across 6 categories: single-session facts, preferences, multi-session reasoning, temporal reasoning, knowledge updates, and assistant facts.

PLUR scores 86.7% using zero-cost hybrid search (BM25 + local embeddings, no API calls).

Competitive leaderboard

System Overall Local-first Zero-cost search Data sovereign
Mastra Observational Memory 95.0%* No No No
Hindsight + Gemini-3 91.4% No No No
SuperLocalMemory C 87.7% Yes Yes Yes
PLUR hybrid (Opus) 86.7% Yes Yes Yes
Supermemory 85.2% No No No
Letta 83.2% No No No
SuperLocalMemory A 74.8% Yes Yes Yes
Zep 71.2% No No No
Mem0 49.0% No No No

*Mastra OM uses conversation compression, not retrieval — a different methodology.

PLUR's key advantage: schema-aware retrieval. BM25 and embeddings see entity names, temporal dates, and rationale text — not just raw statements. Generic search engines cannot leverage structured engram fields.

Per-category breakdown

Category Score Hit@10 Notes
Single-session user facts 100% 100% Perfect — personal facts recalled correctly
Knowledge updates 100% 100% Perfect — contradicting facts handled
Single-session preference 100% 100% Perfect — preferences recalled correctly
Temporal reasoning 80% 80% Strong — up from 60%
Single-session assistant 80% 100% Strong retrieval, occasional answer errors
Multi-session reasoning 60% 80% Up from 20% — hardest category

Strengths: Personal facts, knowledge updates, preferences — the bread and butter of persistent memory. Weaknesses: Multi-session reasoning (connecting information across conversations) and temporal reasoning (understanding when things happened).

Competitive position

PLUR is the only system that combines local-first processing, zero-cost search, data sovereignty, schema-aware retrieval, and an exchange protocol for agent-to-agent knowledge sharing.

Against Supermemory (85.2%)

Supermemory scores higher by sending your data to their cloud. PLUR scores 86.7% while keeping everything on your device. On local knowledge tasks, PLUR wins 89% of the time — because the question isn't how smart the search is, it's whether your knowledge is there at all.

Against Zep (71.2%)

We exceed Zep's accuracy with a fundamentally simpler architecture — no temporal knowledge graph, no cloud infrastructure. Just BM25 + local embeddings + schema-aware search. And your data never leaves your machine.

Against Mem0 (49.0%)

PLUR outperforms Mem0 by 37 points while being fully local. Mem0 requires cloud API keys; PLUR's hybrid search costs exactly $0.

Against vanilla Claude Code

Claude Code is brilliant but forgetful. It doesn't know your tag conventions, your deployment servers, or which of your 100 tools handles trading. PLUR fixes that — 89% win rate on local knowledge, zero penalty on everything else.

Dimension PLUR Best competitor Status
Retrieval (LongMemEval) 86.7% 95% (Mastra) 4th place, ahead of Supermemory
Knowledge updates 100% 85% (Supermemory) We lead (+15pp)
Local-first + zero-cost Yes SuperLocalMemory Unique (schema-aware)
Real-task impact (A/B) 89% win rate Nobody else tests this

Methodology

19 decisive contests across 3 Claude models. We run the same prompt through the agent twice:

Same model. Same prompt. Different context. The only variable is whether the agent has access to persistent memory.

Scoring

Each scenario is scored two ways:

Isolation

Baseline isolation is critical. Claude Code walks up the directory tree looking for CLAUDE.md files — running the baseline inside the project would leak context. Baseline runs execute from /tmp/ with a single-line CLAUDE.md. Our first attempt missed this, and all early results were invalid.

What we learned building this

The enriched schema was everything.

Making BM25 and embeddings see entity names, temporal dates, and rationale text delivered +43 percentage points on LongMemEval. This is a PLUR-unique advantage — generic search engines index raw text; PLUR indexes knowledge-enriched text.

The answering model is the bottleneck, not retrieval.

At 93% Hit@10, the correct memory is almost always in the top 10 results. The 56pp gap between Opus (86.7%) and GPT-4o (30%) comes entirely from how well the model synthesizes an answer from retrieved context. PLUR provides retrieval; the model provides reasoning.

The cheapest model with memory beats the most expensive without it.

Haiku 4.5 at ~$1/run with PLUR (0.80 avg) outperforms Opus 4.5 at ~$10/run without it (0.31 avg) on discoverability. This reframes the cost equation: instead of spending 10x on a bigger model, spend 0.1x on memory.

Static docs compete with dynamic memory.

Our context file was 572 lines of facts that overpowered engram recall. We cut it to 207 lines of instructions and moved facts to engrams. Scores improved across the board. The lesson: context files should teach the agent how to use the system, not dump everything the system knows.

Weaker models benefit from navigation. Stronger models benefit from recall.

Haiku cannot explore — memory gives it navigation (2.2x on inferable scenarios). Opus can explore — memory gives it things it cannot discover (1.7x on memory-only scenarios). PLUR serves both ends of the intelligence spectrum.

What we don't claim

An honest benchmark is one you can trust. Here is what our data does and does not support:

Confidence levels

Claim Confidence Caveat
89% A/B win rate High 35 decided contests across 3 models. Consistent 86–91% per model.
100% house rules Very high 12/12 wins, zero losses, across Haiku/Sonnet/Opus.
86.7% LongMemEval Medium 30-question sample. True score likely 80–93%. Need 500-question run.
Ahead of Supermemory Low-medium Their score may use full 500 questions. Sample sizes differ.
Cost reframe (Haiku > Opus) High Same benchmark, same scenarios, same scoring.

Try it yourself

# clone the repo
git clone https://github.com/plur-ai/plur
cd plur/bench
# list all scenarios
python run.py --list
# run deterministic checks only (fast, no API cost)
python run.py --deterministic-only
# run specific category
python run.py --category discoverability
# full run with LLM-judge
python run.py

Add your own scenarios as YAML files in scenarios/<category>/. The harness handles execution, scoring, and reporting automatically.