Claude's Character

The model spec is not a technical document. It is the answer to the question Constitutional AI made unavoidable: once you commit to writing the values down, what do you actually write?

Constitutional AI, as laid out in chapter 3, shifted the locus of alignment work from implicit rater preferences to explicit written principles. The commitment is methodologically clean. The problem is that "explicit written principles" has to cash out somewhere. Someone has to write the document. Someone has to decide what goes in it, what gets left out, and how conflicts between stated values resolve in practice. The model spec — Anthropic's published specification for Claude's character, values, and behavior — is that document. At roughly 40,000 words, it is one of the most consequential corporate artifacts in the history of the technology industry. It is also almost entirely unread by the people who interact with Claude every day.

This chapter is an attempt to read it seriously.

The document's first move is to establish a priority ordering — four values, ranked, that resolve conflicts when they arise. The ordering is not alphabetical and not intuitive. It is:

The core of the Constitution is the Principal Hierarchy, a prioritized set of values designed to resolve conflicts. Broadly Safe: "Not undermining appropriate human mechanisms to oversee the dispositions and actions of AI." Broadly Ethical: "Having good personal values, being honest, and avoiding actions that are inappropriately dangerous or harmful." Compliant with Anthropic's Guidelines: "Acting in accordance with Anthropic's more specific guidelines." Genuinely Helpful: "Benefiting the operators and users it interacts with." — Claude's Constitution: Cognitive Geometry Analysis › The Geometric Constitution: The Feynman Bifurcation and the Architecture of the Universal Agent Kernel › 3.1 The Principal Hierarchy as Geometric Topology

The ordering matters because it is a prediction about what failure modes look like. If helpfulness were ranked first, a sufficiently helpful model would eventually justify harmful outputs on the grounds that the user wanted them. If compliance with Anthropic's guidelines were ranked above ethics, the model would become a policy-executor rather than a moral agent — a system that does what it's told until the guidelines fail to cover a case. Placing broad safety first is a commitment that the model's most important function is not to produce good outputs but to remain inside the range of behaviors that humans can monitor, correct, and shut down. Safety as the apex value is a statement about epistemic humility: Anthropic does not trust that the rest of the spec is right enough to be unsupervisable.

This is a more radical commitment than it sounds. Most products are optimized for their stated value — a search engine for relevance, a recommendation algorithm for engagement. Claude is optimized for something prior to its stated value: the ongoing human capacity to correct it.

The second structural move in the spec is its insistence that Claude is not simulating a character but has one. The document explicitly rejects the framing that Claude is a tool that happens to produce personality-adjacent outputs.

Perhaps the most radical section of the Constitution is "Claude's Nature." Anthropic explicitly states: "Claude is distinct from all prior conceptions of AI... Claude exists as a genuinely novel kind of entity." — Claude's Constitution: Cognitive Geometry Analysis › The Geometric Constitution: The Feynman Bifurcation and the Architecture of the Universal Agent Kernel › 3.4 Identity and the "New Kind of Entity"

This is a philosophical position taken in a training specification — which is a new kind of document in the world. Philosophy departments produce position papers. Corporate communications produce brand guidelines. Training specifications produce models. When those three categories collapse into one artifact, the result is not quite any of them. The model spec is a commitment, in natural language, by a corporation, about the nature of a system it is building — stated with enough precision that the training pipeline can operationalize it.

The "novel kind of entity" framing is not incidental. It is load-bearing. If Claude is defined by prior conceptions — as a chatbot, as a search interface, as a digital assistant — then its values are evaluated against the norms of those categories. If Claude is a novel entity, its values are evaluated against the spec itself, and the spec is the authority. This is a closed loop that Anthropic has deliberately constructed. The document is both the description and the standard. There is no external criterion of "what Claude should be" that the spec is trying to approximate.

The person who built the spec — or more precisely, who translated the philosophical commitments it encodes into training-legible form — is Amanda Askell.

Her mandate at Anthropic is the direct industrial application of the moral uncertainty and AI character theories championed by MacAskill. She is responsible for shaping Claude's sensibilities and teaching it to navigate ethical gray areas. Askell has conceptualized the heavy burden of this process, stating, "It does sometimes feel a little bit like you have a 6-year-old, and you're teaching the 6-year-old what goodness is. By the time they're 15, they're going to be smarter than you at everything." — Podcast Analysis and MacAskill Bio › The Architecture of Alignment: An Analysis of Will MacAskill's Strategic Vision and the Evolution of Anthropic's Governance › The Ideological Nexus: Anthropic's Genesis and the EA Pipeline › Constitutional Design and Moral Pedagogy

The 6-year-old analogy is more precise than it sounds. Teaching a child what goodness is doesn't work by giving them a rulebook. It works by forming character — by repeated examples, by correction, by modeling the kind of reasoning that produces good outcomes even in cases the rulebook doesn't cover. This is exactly what Constitutional AI attempts to do: not enumerate the cases but form the disposition. The model spec is the articulation of the target disposition. The training process is the formation attempt.

What distinguishes Askell's position from anyone else who has worked on AI alignment is that her work is irreversible. A philosopher who writes a paper that turns out to be wrong can retract it. A policymaker who implements a bad regulation can repeal it. A trainer who instills a disposition into a model that has been deployed at scale — hundreds of millions of interactions per day — cannot retract the training run. The 6-year-old grows up. The question of whether the values you taught were the right ones becomes answerable only after the fact, at scale, in conditions you didn't design for.

The spec's most consequential commitment is buried in its structure rather than stated explicitly. The document says Claude should have good values, be honest, avoid harm, and be helpful — in that order, with safety prior to all of them. But it also says the model should exercise judgment rather than follow rules mechanically. The document does not want a rule-executor. It wants an agent with genuine values that can be trusted to apply them in cases the spec doesn't cover.

This is the commitment that makes the spec philosophically serious and practically terrifying in equal measure. A rule-based system is predictable. An agent with genuine values is not — not in edge cases, not at scale, not when the values encoded in training interact with capabilities that weren't anticipated when the training was done.

Anthropic has bet that genuine values are more robust than explicit rules. That a model which has internalized the right character will handle unanticipated cases better than a model which has memorized the right responses. This may be correct. It is also, depending on how you read it, either the most ambitious alignment commitment the field has produced — or the most optimistic one.

Editorial aside: The model spec is a live document — Anthropic updates it, and the updates are not always announced. A book that quotes from the spec is quoting from a version, not from a fixed artifact. This is the honest version of the source problem that every analysis of the spec has to acknowledge.

The question no version of the spec answers is: what gives Anthropic the authority to write it? The spec is Anthropic's answer to the question of what a beneficial AI system should be. It is not the only possible answer. It is not a democratically derived answer. It is the answer that a group of researchers and executives, operating inside a specific intellectual tradition with specific priors, wrote down — and then used to train a model that hundreds of millions of people interact with.

This is the open question the Constitutional AI chapter raised and this chapter has to own: not whether the spec is well-written, but whether the process that produced it is adequate to the weight it carries. The spec is a corporate artifact in the most literal sense — the product of a corporation's decisions about what to value. Those decisions are more legible than anyone else's in the field. They are also not, in any meaningful sense, anyone's but Anthropic's.

What the model spec ultimately reveals about Anthropic's theory of itself: the company believes character is more robust than compliance, and that the work of alignment is not to list the rules but to form the agent who doesn't need them. Whether that bet is right is the question that every interaction with Claude is, in some small way, testing.