The Harness Matters More Than the Model

Anthropic shipped Claude Fable 5 today. Its published Mythos/Fable table lists 80.3 on SWE-bench Pro against Opus 4.8's 69.2, with the gap widening on the longest, hardest tasks, and Fable costs twice as much per token in both directions. Fable pushes the bottleneck further toward verification and review. A better model doesn't clear that bottleneck. It just pours more work into it.

A harness is three decisions: what runs where, what gets into context, and what you check at every handoff. Those decisions shape your accuracy and your bill more than the model name alone.

A few weeks back I wrote up SPAR, a remediation pipeline anyone can build and adapt. It came out of the finding-fixing gap: the finding side of security got a 100x multiplier while fixing stayed manual, and a better framework on cheaper tools keeps you ahead of the curve. The same argument runs through this post. Scan everything on the cheapest model that can do the job, and reserve the expensive models for the work that's been flagged.

The cascade

Fable plans. Decomposition is where one mistake poisons everything downstream, so the strongest model reads the brief and emits the task plan. A plan is a few thousand tokens; paying Fable's price on it barely moves the bill.

Cheaper tiers execute. Haiku 4.5 takes low-complexity tasks, Sonnet 4.6 medium, Opus 4.8 hard or critical. Tiering is also a blast-radius control: a Haiku worker classifying strings carries neither credentials nor the full plan.

Fable or Opus validates. A validator scores each output against the original intent and rejects below threshold.

Two loops close it. Rejected work returns to its builder with the validator's feedback. Plan failures return to Fable for re-planning. The loops make it a harness rather than a relay that passes work forward and hopes.

The economics require cheaper tiers to carry roughly 90% of execution volume. Running Fable when Haiku would do is a 10x premium on both input and output. Fable earns its rate at the judgment points: planning and the validation gate, rarely on execution itself.

Accuracy: context over capability

Accuracy is a context problem before it's a model problem. The more you put in the window, the more the model forgets, recall drops as the context grows. Anthropic's own guidance is to keep the context small and high-signal and push work into sub-agents that hand back a summary instead of their whole transcript. Their multi-agent research setup, where each sub-agent works in its own clean window, beat a single-agent version by 90.2% on their internal research eval. And the stronger the model, the more it gains from this: given file-based memory to take notes while playing Slay the Spire, a long-horizon agent test, Fable 5 improved three times more than Opus 4.8 did, in Anthropic's launch announcement.

Security: every handoff is a trust boundary

An agent can't tell instructions from data. Whatever sits in its context can steer what it does next, and that's the whole problem. So follow a prompt injection through the pipeline and watch where it goes. A worker pulls a tool result, say a retrieved document, and someone has buried an instruction in the body. Nothing strips it, so it rides along to the next handoff. Now the planner is holding the injected goal right next to the real one, and on a long context it can't reliably tell which is which. The validator looks at the same poisoned context, decides the plan is consistent, and waves it through.

The harness breaks that chain in three places. Scan treats every tool result as untrusted and runs before anything crosses a tier boundary. Sub-agent isolation reads the untrusted external content with narrow tools and hands back a summary instead of raw text, so the injection never reaches the planner's window in the first place. And the validator checks output against the original intent, not against whatever the context happens to say now, which is why you re-anchor at checkpoints on a long session. Under all three, a log line on every handoff records what moved where, because any hop can quietly pass an injected instruction along as if it were trusted. Drop any one of the three and the cascade turns into an amplifier for the attack instead of a defense against it. None of these three breaks is a model feature; they live in the harness, so they carry over when you swap the model underneath.

Third-party Claude Code Skills are the same problem in a different wrapper. A skill bundles instructions, scripts, and tool permissions, and Anthropic warns that a malicious one can steer Claude to invoke tools or run code against its stated purpose. Treat an untrusted skill like an executable dependency with privileged instructions, not like a prompt snippet.

One limit here isn't really about the cascade. No setup of models gets around poison that's already sitting in the source every tier reads, because the second model you add to check the first just reads the same poisoned source and signs off on it. What helps is checking against something the injection can't touch, the user's original intent, or an org policy, kept outside the session and compared against the whole run instead of whatever the context says right now. That's what the validator's intent check is going for, and it's the thing that still works when the input was poisoned from the start.

Cutting the bill

Routing is the biggest lever: a 70/20/10 Haiku/Sonnet/Opus split on a mixed workload cuts spend roughly 40% against all-Sonnet. Prompt caching is the next: reads are 0.1x base input, 5-minute writes are 1.25x and 1-hour writes are 2x, and the savings compound because each tier makes repeated calls over its own cached prefix. The Batch API takes 50% off both directions for anything tolerating a 24-hour window (Scan sweeps, evals, nightly analysis), and when cache hits land in a batch the discounts stack.

Where a single model wins

The cascade wins at volume on decomposable work, and there are two places it doesn't. At low volume, orchestration overhead beats the savings; a few hundred requests a day wants one strong model and a good harness, because your time is the expensive part, not the tokens. And anything a user waits on in real time loses to the round trips, where they cost more than the quality buys. There, keep the routing and drop the handoffs: one direct call to the cheapest model that fits.

Before you deploy Fable 5

Two things change in production, neither about capability. Safety classifiers can route a flagged request to Opus 4.8 instead; on the product surfaces Anthropic reports over 95% of sessions never see this, and in the API it is an opt-in beta fallbacks parameter, not automatic on every call. Either way, log which model answered each turn, or your benchmarks quietly blend two models. And Fable 5 ships with 30-day data retention, no zero-data-retention option (Anthropic designates it a Covered Model). Compliance review before regulated data touches it.

What's actually new

Fable 5 is being sold on capability. The part worth your attention is the screen in front of it. A cheap classifier checks every request, and it's the same idea as Scan, with the same blind spot at the application layer: a request-level check can't prove a whole run still matches its original intent, so a run made of individually harmless requests sails through. To catch that in your own pipeline you have to hold the original intent and check the whole run against it, not each request on its own. And this gets more important, not less, as these models get good enough to trust with real work. When most of the code is machine-written, the question you care about stops being "did someone mean to write this?" and becomes "does this match what we actually asked for?" A per-request filter can't answer that. The harness can, on every output, and cheaply.

The harness outlasts the model

Anthropic shipped the better model today. A better one lands every three or four months. The harness is the part that carries over: small contexts, tiered routing, security at every handoff. Swap the model underneath and your accuracy and your savings stay. The harness matters more than the model.