The Doctrine

The Responsible Scaling Policy ↗ was Anthropic's bet that self-imposed constraints, stated publicly in advance, could function as a governance mechanism where external regulation didn't yet exist.
The bet has a specific structure. Anthropic would define, in advance, capability thresholds — called ASL levels, for AI Safety Levels — at which its models would require specific safety mitigations before deployment. Below ASL-2 (current frontier models), standard mitigations suffice. At ASL-3, Anthropic commits to deploying only with specific security and access controls. At ASL-4, no deployment path exists that doesn't involve capabilities the company hasn't built yet. The policy commits Anthropic to halt deployment if mitigations aren't in place — even if the business consequences of halting are severe.
What makes this a governance instrument rather than a communications document is the asymmetry of the commitment. Anthropic can update the RSP, and it has. But updates require public disclosure, explanation, and revision against the prior commitments. The policy is not a promise of no exceptions — it is a structure that makes exceptions legible and therefore costly. A company can quietly ignore an internal safety policy. It cannot quietly revise a public commitment without the revision being noticed.
The RSP is a first-of-its-kind governance document that commits the company to specific safety and security measures that scale with a model's capabilities. Inspired by the Biosafety Levels used in infectious disease research, the RSP categorizes models into "AI Safety Levels" (ASL). Under this framework, ASL-3 systems — those showing high potential for misuse in biological weapons or cyber-attacks — require hardened weight protection and multi-layered deployment controls. ASL-4 systems, with the capability for autonomous R&D or catastrophic misuse, trigger potentially prohibitive security requirements and a possible deployment pause.
The interpretability program is the other load-bearing element of the doctrine, and it operates on a different timescale. RSP governs deployment decisions in the near term — it is a commitment mechanism for the years ahead. The interpretability work at Anthropic is a research investment in a question that may take decades to resolve: whether the internal representations of a large neural network can be understood well enough to verify what the model is actually doing, not just what it appears to be doing.
Chris Olah's circuits program, which Anthropic inherited when Olah joined from Google Brain, starts from a specific commitment: that the model is not a black box, that understanding is possible, and that the work of understanding is to map the mechanism. The early results were striking — the program identified monosemantic and polysemantic neurons, circuits responsible for specific behaviors, a curve detector that generalized across architectures. The harder question — whether interpretability at the scale of frontier models is tractable at all — remains open.
Superposition is the "strongly-coupled" regime of feature representation. Features are "encoded non-orthogonally". In the "neuron basis," a single neuron is polysemantic, meaning it couples strongly to many unrelated concepts. Trying to reverse-engineer a circuit in this basis is computationally intractable. The S-dual solution is an algorithmic one: find a "weakly-coupled" dual basis where features are naturally orthogonal (monosemantic). This is precisely what Sparse Autoencoders (SAEs) are designed to do.
The doctrine holds RSP and interpretability together because they require each other. RSP without interpretability is a policy without a verification method — you can commit to not deploying dangerous models, but you can't verify what's dangerous without some way to look inside the model. Interpretability without RSP is a research program without an operational consequence — it produces knowledge without a commitment to use that knowledge as a deployment gate.
Whether this relationship is currently load-bearing is a legitimate question. The RSP's deployment decisions today rely primarily on capability evaluations — behavioral tests, red-teaming, structured elicitation — not on interpretability results. The interpretability program is not yet mature enough to be the primary safety gate. What the doctrine commits to is that interpretability results will become the primary safety gate as the program matures. This is a bet about the next ten years, not the next twelve months.
While Anthropic was founded on the premise of safety-first, its own "Series B" fundraising pitch deck utilized the very same aggressive scaling logic to attract investors. The deck stated, "We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles." This reveals that Anthropic's leadership accepts the "winner-takes-all" dynamic of the scaling laws they helped discover. The schism, therefore, was not about whether to build the singularity, but who should control the "kill switch" when it arrives.
There is an alternative reading of the doctrine — not cynical, but precise — that the safety posture and the commercial identity are the same artifact. Anthropic is not a safety lab that also sells AI access and a consumer product. It is a company that has made its safety research the core of its brand differentiation in a market where all of its major competitors are racing to build the same technology. The RSP is a commitment device, yes — but it is also the clearest statement of what makes Anthropic distinct from OpenAI, Google DeepMind, and every other frontier lab that doesn't publish its deployment gates.
This reading is not a critique. The fact that the doctrine serves commercial purposes doesn't mean it isn't also a genuine governance mechanism. The two are not mutually exclusive. But it matters for understanding what would cause the doctrine to bend: it won't bend under commercial pressure alone, because commercial pressure argues against bending. It would bend — if it bends — under a combination of competitive threat and internal erosion of the belief that the policy is actually working.
Editorial aside: The "Anthropic Mythos" documents in the corpus — the analyses of Anthropic's technical breakthroughs, the strategic positioning papers — read as both a record of genuine capability progress and as a catalogue of narrative moves. The doctrine is not only a policy; it is a story Anthropic tells about itself. Both things are true simultaneously, and both things can be analyzed.
What the doctrine has produced, five years in, is a company with higher institutional coherence than most AI labs — where research decisions and deployment decisions and hiring decisions all reference the same organizing principles — and genuine uncertainty about whether those principles are the right ones. That uncertainty is not a failure of the doctrine. It is the honest condition of anyone who has written down what they believe and has been honest enough to notice that writing things down doesn't settle whether they're true.
Doctrine without talent to execute it is a philosophy paper — which means the next question is who Anthropic has actually hired, and what that pattern reveals.