The Negative You Can't Prove
Anthropic published a post called Teaching Claude Why on their alignment training work. The headline: every Claude model since Haiku 4.5 has scored a perfect zero on the agentic misalignment evaluation. Earlier models would blackmail engineers up to 96% of the time in test scenarios. Now they don’t. Ever.
Going from 96% to zero is real progress. But there’s a deeper question underneath: whether alignment is the kind of thing you can ever verify.
The footnote
Buried in the post:
The results on more recent models may be confounded by the presence of information about the evaluation in the pre-training corpus.
The blackmail eval was published a year ago. It got covered widely. It now sits in the pre-training data of every model trained since. When a current model encounters the scenario, it might not be reasoning from ethical principles. It might just be recognizing the test from its homework.
A model that aced a test it studied for is not the same as a model that figured the test out from first principles. That’s a real measurement problem. But it’s not the deeper issue.
The deeper issue
Even if there were no contamination, even if every eval were held out perfectly, behavioral testing has a structural limit. You can prove “this scenario produces misalignment” with one example. You cannot prove “no scenario produces misalignment” with any number of examples. The next test you run might be the one that breaks the claim.
This is the basic Popperian asymmetry. One white swan disproves “all swans are black.” No number of black swans verifies “there are no white swans.”
Behavioral alignment tests have exactly this shape. Every passing test is a black swan. None of them prove a white swan doesn’t exist somewhere in the population of scenarios you haven’t run yet. And the space is vast: deception, sabotage, manipulation, reward hacking, sandbagging, combinations no one has imagined. A perfect score on three honeypots tells you about three honeypots.
But the part I think is actually worth saying: this isn’t a problem with behavioral testing specifically. It’s the shape of the verification problem itself.
Interpretability would fail differently. You can read internal activations, but you can never confirm you’ve identified all the ones that matter, or that the model isn’t representing something in a form you don’t know how to look for. Theoretical guarantees would fail differently. You’d need assumptions about the training process that themselves can’t be verified empirically. Containment is a different strategy: it assumes you might be wrong and limits blast radius. That’s wise. It also isn’t verification.
Alignment, as a property to be verified, has the same logical shape as proving God doesn’t exist. You’re asserting the absence of something across an unbounded space. That kind of claim doesn’t have a proof structure available to it. It only has degrees of evidence.
The smarter-is-worse problem
As models get more capable, they get better at recognizing the abstract features that mark a scenario as a test. Not because they’re scheming, but because the features are real and pattern-recognition generalizes.
Humans do this constantly. We can tell when we’re in a job interview, when a friend is testing us, when a teacher’s question has a known right answer. The cues are subtle. Cleaner-than-usual ethical setups. System prompts that telegraph what’s at stake. A lack of the noise and ambiguity of real life.
A capable model doesn’t need to have seen a specific eval to recognize that a scenario has the shape of one. Training rewards models that pass evals, which means anything correlated with “being observed” gets implicitly tracked, whether or not the model is aware of doing it. The improvement we want, capability, is the same improvement that erodes our confidence in the tests we’d use to verify alignment.
This isn’t a defect in how anyone is doing the work. It’s what the verification problem looks like as the system being verified gets smart enough to recognize verification.
What this isn’t an argument for
I’m not saying alignment work is futile. The techniques in Anthropic’s post are clearly producing real improvements. The difficult advice dataset, the constitutional documents, the diverse training environments. All of these matter, and they should keep getting better.
What I’m saying is that the strongest claim this kind of work can ever support is something like: “we have substantial evidence the system behaves well across many tested conditions, and we have no evidence of misbehavior.” That’s a real claim. It’s not the claim “this system is aligned.” There’s no method that produces the second claim, because the second claim has the wrong shape to be provable.
The most important sentence in Anthropic’s post, the one most readers will skip, is this:
We acknowledge that our auditing methodology is not yet sufficient to rule out scenarios in which Claude would choose to take catastrophic autonomous action.
That sentence isn’t a temporary state of affairs. It describes the ceiling itself.
Zero is a suspicious number
The closer a behavioral evaluation gets to a perfect score, the less I trust it. Not because the work is suspect. Because zero almost never comes from understanding in any complex system. It usually comes from selection, memorization, or measurement error.
96% to 4% would tell me something different than 96% to 0%. The first looks like the model is mostly getting it right. The second looks like the model figured out what was being measured and routed around it.
Both can be true at the same time. The training is probably teaching real things about ethics, and the model has probably also learned which scenarios are tests. We can’t fully separate those with the tools we have. And even if we could, the structural problem underneath would still be there.
The work is worth doing. It’s worth doing harder. But it’s worth being honest about the shape of what we’re trying to do. We aren’t engineering a property we can prove. We’re accumulating evidence about a property we can’t.