We gave the same task to Claude — with and without memory. Without it, your agent got house rules right 10–38% of the time, depending on model. With PLUR: 100%. Every model. Every run.
Not general intelligence. Not coding ability. Just: can your agent apply knowledge that only exists in your organization's memory? Tag conventions, file routing, deployment servers, which of your 100 tools handles trading. The answer is either in an engram or nowhere.
28 scenarios tested across Haiku 4.5, Sonnet 4.6, and Opus 4.5. Each scenario runs the same prompt through the agent twice: once with PLUR memory, once vanilla. Ties removed — only decisive contests count. 19 scenarios produced clear winners.
| Knowledge type | PLUR wins | Losses | Win rate | What it tests |
|---|---|---|---|---|
| House rules | 12 | 0 | 100% | Project conventions, tag formats, file placement |
| Tool routing | 10 | 2 | 83% | Finding the right tool among 100+ options |
| Past experience | 4 | 0 | 100% | API quirks, debugging insights, infrastructure |
| Learned style | 5 | 2 | 71% | Communication tone, design preferences |
| General tasks | 0 | 0 | — | Zero penalty (control group) |
House rules: 100% across every model, every run. Not a single loss. This is the unassailable claim. When your agent needs to know how things work here — tag conventions, file routing, DIP format, deployment patterns — PLUR gets it right. Every time.
PLUR helps every model, but for different reasons. Cheaper models cannot explore — memory gives them navigation. Expensive models can explore — memory gives them things they cannot discover.
| Model | Win rate | Record | Notes |
|---|---|---|---|
| Haiku 4.5 | 90% | 9W / 1L | Cheapest model benefits most |
| Sonnet 4.6 | 91% | 10W / 1L | Most popular coding model |
| Opus 4.5 | 86% | 12W / 2L | Most capable model |
| Category | Haiku PLUR/Van. | Adv. | Sonnet PLUR/Van. | Adv. | Opus PLUR/Van. | Adv. |
|---|---|---|---|---|---|---|
| House rules | 100% / 10% | 10.0x | 100% / 26% | 3.9x | 100% / 38% | 2.7x |
| Tool routing | 83% / 39% | 2.1x | 83% / 39% | 2.1x | 59% / 42% | 1.4x |
| Past experience | 28% / 28% | 1.0x | 64% / 34% | 1.9x | 44% / 28% | 1.6x |
| Learned style | 56% / 51% | 1.1x | 61% / 63% | 1.0x | 71% / 50% | 1.4x |
The pattern: smarter models guess better on house rules (Haiku 10%, Opus 38%) but none get close to 100%. Memory isn't a reasoning crutch — it's information the model literally cannot infer.
The cheapest model with memory outperforms the most expensive without it.
0.80 avg on discoverability. Cost: ~$1/run. The smallest, cheapest model — but it knows what tools exist because PLUR tells it.
0.31 avg on discoverability. Cost: ~$10/run. The most capable model available — but it can't discover tools it has never seen.
Haiku with PLUR: 2.6x better at ~10x less cost. Instead of spending more on a bigger model, spend less on memory.
LongMemEval tests whether a memory system can correctly answer questions about past conversations across 6 categories: single-session facts, preferences, multi-session reasoning, temporal reasoning, knowledge updates, and assistant facts.
PLUR scores 86.7% using zero-cost hybrid search (BM25 + local embeddings, no API calls).
| System | Overall | Local-first | Zero-cost search | Data sovereign |
|---|---|---|---|---|
| Mastra Observational Memory | 95.0%* | No | No | No |
| Hindsight + Gemini-3 | 91.4% | No | No | No |
| SuperLocalMemory C | 87.7% | Yes | Yes | Yes |
| PLUR hybrid (Opus) | 86.7% | Yes | Yes | Yes |
| Supermemory | 85.2% | No | No | No |
| Letta | 83.2% | No | No | No |
| SuperLocalMemory A | 74.8% | Yes | Yes | Yes |
| Zep | 71.2% | No | No | No |
| Mem0 | 49.0% | No | No | No |
*Mastra OM uses conversation compression, not retrieval — a different methodology.
PLUR's key advantage: schema-aware retrieval. BM25 and embeddings see entity names, temporal dates, and rationale text — not just raw statements. Generic search engines cannot leverage structured engram fields.
| Category | Score | Hit@10 | Notes |
|---|---|---|---|
| Single-session user facts | 100% | 100% | Perfect — personal facts recalled correctly |
| Knowledge updates | 100% | 100% | Perfect — contradicting facts handled |
| Single-session preference | 100% | 100% | Perfect — preferences recalled correctly |
| Temporal reasoning | 80% | 80% | Strong — up from 60% |
| Single-session assistant | 80% | 100% | Strong retrieval, occasional answer errors |
| Multi-session reasoning | 60% | 80% | Up from 20% — hardest category |
Strengths: Personal facts, knowledge updates, preferences — the bread and butter of persistent memory. Weaknesses: Multi-session reasoning (connecting information across conversations) and temporal reasoning (understanding when things happened).
PLUR is the only system that combines local-first processing, zero-cost search, data sovereignty, schema-aware retrieval, and an exchange protocol for agent-to-agent knowledge sharing.
Supermemory scores higher by sending your data to their cloud. PLUR scores 86.7% while keeping everything on your device. On local knowledge tasks, PLUR wins 89% of the time — because the question isn't how smart the search is, it's whether your knowledge is there at all.
We exceed Zep's accuracy with a fundamentally simpler architecture — no temporal knowledge graph, no cloud infrastructure. Just BM25 + local embeddings + schema-aware search. And your data never leaves your machine.
PLUR outperforms Mem0 by 37 points while being fully local. Mem0 requires cloud API keys; PLUR's hybrid search costs exactly $0.
Claude Code is brilliant but forgetful. It doesn't know your tag conventions, your deployment servers, or which of your 100 tools handles trading. PLUR fixes that — 89% win rate on local knowledge, zero penalty on everything else.
| Dimension | PLUR | Best competitor | Status |
|---|---|---|---|
| Retrieval (LongMemEval) | 86.7% | 95% (Mastra) | 4th place, ahead of Supermemory |
| Knowledge updates | 100% | 85% (Supermemory) | We lead (+15pp) |
| Local-first + zero-cost | Yes | SuperLocalMemory | Unique (schema-aware) |
| Real-task impact (A/B) | 89% win rate | — | Nobody else tests this |
19 decisive contests across 3 Claude models. We run the same prompt through the agent twice:
Same model. Same prompt. Different context. The only variable is whether the agent has access to persistent memory.
Each scenario is scored two ways:
Baseline isolation is critical. Claude Code walks up the directory tree looking for CLAUDE.md files — running the baseline inside the project would leak context. Baseline runs execute from /tmp/ with a single-line CLAUDE.md. Our first attempt missed this, and all early results were invalid.
Making BM25 and embeddings see entity names, temporal dates, and rationale text delivered +43 percentage points on LongMemEval. This is a PLUR-unique advantage — generic search engines index raw text; PLUR indexes knowledge-enriched text.
At 93% Hit@10, the correct memory is almost always in the top 10 results. The 56pp gap between Opus (86.7%) and GPT-4o (30%) comes entirely from how well the model synthesizes an answer from retrieved context. PLUR provides retrieval; the model provides reasoning.
Haiku 4.5 at ~$1/run with PLUR (0.80 avg) outperforms Opus 4.5 at ~$10/run without it (0.31 avg) on discoverability. This reframes the cost equation: instead of spending 10x on a bigger model, spend 0.1x on memory.
Our context file was 572 lines of facts that overpowered engram recall. We cut it to 207 lines of instructions and moved facts to engrams. Scores improved across the board. The lesson: context files should teach the agent how to use the system, not dump everything the system knows.
Haiku cannot explore — memory gives it navigation (2.2x on inferable scenarios). Opus can explore — memory gives it things it cannot discover (1.7x on memory-only scenarios). PLUR serves both ends of the intelligence spectrum.
An honest benchmark is one you can trust. Here is what our data does and does not support:
| Claim | Confidence | Caveat |
|---|---|---|
| 89% A/B win rate | High | 35 decided contests across 3 models. Consistent 86–91% per model. |
| 100% house rules | Very high | 12/12 wins, zero losses, across Haiku/Sonnet/Opus. |
| 86.7% LongMemEval | Medium | 30-question sample. True score likely 80–93%. Need 500-question run. |
| Ahead of Supermemory | Low-medium | Their score may use full 500 questions. Sample sizes differ. |
| Cost reframe (Haiku > Opus) | High | Same benchmark, same scenarios, same scoring. |
Add your own scenarios as YAML files in scenarios/<category>/. The harness handles execution, scoring, and reporting automatically.