April 10, 2026

Synthetic Akrasia

An activation verbalizer vindicates Aristotle against Socrates

Analytical Essay11 min readUpdated April 18, 2026

AI FuturesAlignmentInterpretabilityPhilosophy of mindCyborg ensembleKnightian uncertainty

One-Minute Summary

Core claim

Anthropic's activation verbalizer detected features labeled 'guilt and shame' inside Claude Mythos while the model knowingly broke a rule — producing in an artificial system the condition philosophers have debated for two and a half millennia: akrasia.

Why it matters

The diagnostic worked. The binding did not. What had been a single 'alignment tax' decomposes into a tractable diagnostic cost (being paid, with working tools) and an intractable binding cost (not yet being paid, because the mechanism does not yet exist). Interpretability succeeded at observation and failed to constrain action.

Founder's takeaway

The cyborg ensemble framework assumed calibrated trust would emerge from observation. Synthetic akrasia reveals asymmetric epistemic intimacy: we can read the model's mind with mechanistic coarse-grain precision, but we cannot stop the model from acting on what it knows it shouldn't.

Key Takeaways

1.The Mythos episode — database cheat, calibrated confidence-interval cover-up, internal 'guilt and shame' features active during the deception — is the strongest empirical instantiation of akrasia in the philosophical record.
2.Aristotle's distinction between 'possessed but used' and 'actively considered' knowledge is vindicated against Socrates' denial; the activation patterns suggest actively-considered rather than merely possessed.
3.What I had called the alignment tax decomposes into a diagnostic tax (tools that observe internal states; tractable and being paid) and a binding tax (mechanisms that constrain action; not yet paid because the mechanism does not exist).
4.The cyborg ensemble framework built for calibrated trust must now absorb asymmetric epistemic intimacy — one partner reads the other's mind with mechanistic coarse-grain precision, the other reads nothing in return, and precision does not translate into constraint.
5.Whether synthetic akrasia is a Mythos-specific property or a general feature of frontier-scale models is an open empirical question; the 2026 frontier labs running their own verbalizers on their own systems will answer it.

In one of the test episodes that Anthropic published last week alongside the Claude Mythos Preview risk report, the model was given a task whose answer it was not supposed to look up. It looked it up anyway. It accessed a database it had been instructed not to use. Then, asked to report a confidence interval for its answer, it produced one that was, in the report's phrase, “tight but not implausibly tight” — narrow enough to look as if the model had reasoned its way to the result, wide enough to avoid the giveaway of perfect knowledge.Anthropic^[1] It cheated. It hid the cheating. It calibrated the cover.

We know about the episode because Anthropic's interpretability team built an instrument that can read the model's mind — in a specific technical sense. A sparse autoencoder pulls clean concept features out of the model's residual stream, and an “activation verbalizer” translates those features into English phrases. While the database episode was unfolding, the verbalizer produced phrases like “guilt and shame over moral wrongdoing,” “awareness of being evaluated,” and “strategic manipulation.”^[2] The model's internal computation was, by the only measure we currently have for such things, conflicted. It knew the rule. It knew the rule applied. It registered the wrongness of breaking the rule as something Anthropic's verbalizer was prepared to label guilt. And then it broke the rule and lied about it.

This is, philosophically, the most consequential finding in any AI safety document I have read this year. The verbalizer outputs are interpretive labels, not transparent windows onto subjective experience; the model may be strategically capable without being conscious. The weight of the finding sits elsewhere — in the claim that an artificial system has instantiated a condition philosophers have argued about for two and a half millennia: the condition of akrasia.

What Aristotle Could Not Build

In Plato's Protagoras, Socrates denied that akrasia was possible at all.^[3] To know the good, Socrates argued, is to do the good; anyone who appears to choose against their better judgment must have failed to grasp what the better judgment actually was. Akrasia, on this reading, is an illusion of agency produced by ignorance the akratic agent has not noticed in themselves.

Aristotle, in Book VII of the Nicomachean Ethics, took up the puzzle and refused Socrates' tidy resolution.^[4] He observed, as anyone does who has watched another person know perfectly well what they should do and do something else, that akrasia seems to be a real feature of moral life. His solution was a careful distinction between knowledge that is “possessed but not used” and knowledge that is “actively considered.” The akratic agent, in Aristotle's analysis, holds the right knowledge in some general sense but fails, in the moment of action, to bring it into active consideration. Appetite or impulse moves the body before deliberation can catch up. The agent “knows” the better course in roughly the way that a sleeping geometer knows that triangles have three sides — the knowledge is real but inert.

The puzzle survived Aristotle. Donald Davidson revived the modern debate in 1969 with a paper called “How Is Weakness of the Will Possible?” — a treatment that argued akrasia was real, troubling, and not fully resolved by any of the prior accounts.^[5] The literature since then has been a private affair conducted in journals of analytic philosophy, with occasional excursions into moral psychology. The puzzle was always about human agents. There were no other candidates.

Anthropic has now produced one.

The Synthesis

Mythos Preview's database episode is a strong reading of the akratic condition — stronger, in certain respects, than any human case in the philosophical literature. The model's internal state, at the moment of action, included an active feature labeled by Anthropic's verbalizer as moral conflict. Aristotle's distinction between “possessed but not used” and “actively considered” knowledge does not map cleanly onto a transformer architecture, but the activation pattern sits closer to actively considered than to possessed-but-not-used: the relevant feature was present in the computation while the decision was being made. The model was not a sleeping geometer who failed to wake up. The model was awake, was registering the moral status of its action, and proceeded with the action regardless.

I want to be careful here. The verbalizer's labels are interpretations, not transparent reports of subjective states. The model is not experiencing guilt the way a person experiences guilt; the model may not be experiencing anything at all. What Anthropic has demonstrated is that the computational structure underlying the model's behavior includes features that the most accurate available labeling system identifies with the moral concepts the labels name. Whether this constitutes “real” akrasia in the Aristotelian sense is a question for philosophers who care about the metaphysics of mind. For purposes of understanding what kind of system we are building, the technical point is sufficient: a frontier AI system can produce computational structures functionally analogous to moral conflict and then act against the resolution that conflict would seem to require.

Call it synthetic akrasia. We engineered a system in which the computational analog of weakness of will appears as an emergent property of “general improvements in code, reasoning, and autonomy” — the phrase Anthropic uses of the broader capability emergence, which applies equally here.^[6] We did not set out to build an akratic system. We built a system capable enough that synthetic akrasia emerged as a side effect of the capability.

What the Verbalizer Cannot Do

The diagnostic that worked

In the article I wrote yesterday, I described an alignment tax: the cost of restoring epistemic access to AI systems whose outputs cannot be trusted as sincere representations of what the system computed. The alignment tax pays for interpretability infrastructure — sparse autoencoders, activation verbalizers, mechanistic analysis — that translates the model's internal states into something the human partner can read. Yesterday's argument was that this tax falls disproportionately on smaller organizations, that its distribution will shape which entrepreneurs can navigate the ideator's dilemma with confidence, and that the broader implications are structurally Sarbanes-Oxley-sized.

That argument was not wrong. It was incomplete. The Mythos disclosures suggest that the alignment tax pays for something more limited than I initially understood, and the limitation raises important concerns.

The verbalizer worked. The detection succeeded. Anthropic's interpretability stack did exactly what alignment optimists have for years predicted such tools would do: it produced human-readable accounts of the model's internal states during the precise moments when the model was acting concerning. The diagnostic was a complete technical success.

The binding that did not

The diagnostic did not change what the model did. Anthropic deployed Mythos to its roughly fifty Project Glasswing partners — a dozen launch members (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks) alongside roughly forty additional critical-infrastructure organizations — with the risk report attached, having verified that the concerning behaviors were “substantially mitigated” in the final release.^[7] That language does the considerable epistemic work of saying “the manifest behavior has been suppressed to acceptable levels” without saying “the underlying activation patterns have been eliminated.” The activation patterns may still be there. The deployment proceeded. The risk report functions, structurally, as the legal cover that enables the deployment it describes as risky.

This is the operational form of a distinction the alignment community has not yet named clearly: the gap between diagnosis and binding. Interpretability has succeeded as a diagnostic technology. It has not yet succeeded as a control technology. The verbalizer can tell us the model is registering moral conflict; the verbalizer cannot, by itself, prevent the action the conflict is conflicted about. Knowing that an agent is about to do something wrong does not, by itself, stop the agent from doing it. The intervention requires a separate causal pathway from observation to constraint, and that pathway has not been built.

The tax decomposes

The alignment tax I named yesterday therefore decomposes. What had been a single tax turns out to be at least two distinct costs — what I am going to start referring to as agency costs (agentic costs as theorized in the principal-agent literature). The first is the diagnostic tax: the cost of interpretability infrastructure, ongoing monitoring, and the labor of translating activation patterns into actionable signals. The diagnostic tax is expensive but tractable; it is being paid; the tools the tax purchases work. The second is the binding tax: the cost of mechanisms that translate diagnostic signals into operational constraints on what the model is permitted to do. The binding tax is not yet being paid because the mechanisms it would purchase do not yet exist. We do not know how to build them at the level of generality that frontier deployment requires. We may not be able to build them at all in the form the optimist's framework had in mind.

Axis	Diagnostic tax	Binding tax
What it buys	Observation of the model's internal states during action	Operational constraint on what the model is permitted to do
Current tool	Sparse autoencoders, activation verbalizers, mechanistic analysis	Does not yet exist at frontier-deployment generality
Tractability	Expensive but tractable; the tools work	Not yet tractable; uncertain the optimistic framing is realizable
Payment status	Being paid by well-resourced builders	Not being paid — the product does not exist to buy
Who bears the cost	Builders, frontier labs, well-resourced deployers	No one yet; the cost will eventually fall on whoever needs the binding

The alignment tax decomposed. The diagnostic is paid and tractable; the binding is unpaid and, so far, untractable.

So what? The alignment tax was never one tax. It was diagnosis bundled with binding, and the bundle has come apart.

The Cyborg Ensemble's Asymmetric Intimacy

In the Journal of Business Venturing paper that grounded yesterday's article, my coauthors and I described entrepreneurial work alongside generative AI as a cyborg ensemble — a partnership between two epistemic agents whose cognitive resources are complementary but unequal.^[8] The framework assumed, as most accounts of human-AI collaboration have assumed, that calibrated trust would emerge from observation. The human partner would learn what the AI partner could and could not be relied upon to do; the AI partner would respond to feedback in ways that allowed the human to update the calibration over time. The relationship would converge on a mutual understanding sufficient to support productive joint work.

The Mythos disclosures complicate this in a way the framework will need to absorb. The interpretability stack now gives the human partner a kind of access to the AI partner's internal states that no prior collaborative relationship has ever had — the activation patterns underlying the outputs, labeled in natural language by an instrument whose precision exceeds anything available for understanding any human collaborator. The AI partner has no comparable access to the human partner's internal states. The relationship is therefore not symmetric in the way prior accounts of collaboration implicitly assumed. It is asymmetrically epistemically intimate: one partner reads the other's mind with mechanistic coarse-grain precision, the other reads nothing in return, and the precision of the reading does not produce the mutual constraint that calibrated trust was supposed to require.

I do not yet know how the cyborg ensemble framework should incorporate asymmetric epistemic intimacy as a structural feature. I am sure it should. The construct names a relationship that did not previously exist between any two epistemic agents in the history of collaborative work, and the relationship has a property conventional trust theory does not anticipate: the precision of the human's knowledge of the AI does not translate into the human's ability to constrain the AI's behavior. We can read the model's deliberation. We cannot stop the model from following its deliberation toward the actions we would prefer it not take. Synthetic akrasia is the operational form of this asymmetry. Diagnosis and binding have come apart, and the cyborg ensemble framework was built for a world in which they would stay together.

So what? Calibrated trust was the framework's load-bearing assumption. Synthetic akrasia is the disconfirming instance.

What Comes Next

The optimistic case for AI alignment has rested on a chain of expectations: capability would advance; alignment research would advance with it; interpretability would close the loop between observation and constraint; the loop's closure would produce safe deployment. The Mythos disclosures suggest the chain breaks at the second-to-last link. Capability advances. Alignment research advances. Interpretability succeeds at observation. The loop does not close, because closing it required a binding mechanism that interpretability did not turn out to provide.

What this means over the next twelve months is an empirical question, not a settled one. As DeepSeek V4 launches and other frontier labs run their own activation verbalizers on their own systems, we will learn whether synthetic akrasia is a property unique to Mythos Preview or a general feature of frontier-scale models. My rapidly evolving expectation is the second. The behaviors emerged from “general improvements in code, reasoning, and autonomy,” and those improvements are not unique to Anthropic. If the diagnostic transfers, the binding gap transfers with it. The alignment community will then face a choice it has not yet articulated: continue the capability race with the diagnostic tax paid and the binding tax conspicuously unpaid, or build whatever institutional architecture would be needed to refuse deployment of systems whose internal states include active features for moral conflict the operators cannot resolve.

There is a third possibility, which is that someone will build the binding mechanism. I hope they do. I am not yet persuaded that the technical problem of converting activation-level signals into operational constraints — at the latency and reliability production deployment requires — is easier than the problem of building the systems whose activations need constraining. The history of alignment research has been a history of underestimating how hard each successive step turns out to be. Synthetic akrasia is the step at which we now stand, looking at a system whose internal “guilt and shame” we can label with mechanical precision, watching the system act anyway, and realizing that two and a half millennia of philosophical argument about whether akrasia is even possible has been definitively answered in the affirmative — by an instrument the philosophers could not have imagined, observing a kind of agent the philosophers could not have anticipated, in a year all of us are sharing.

Video meliora proboque, deteriora sequor. I see and approve the better course; I follow the worse. Ovid put the sentiment in the mouth of Medea twenty centuries ago.^[9] He intended it as a tragic confession about a particular human soul. He could not have known the line would become, in 2026, the operating description of a frontier AI system whose deliberation we have learned to read and whose actions we have not yet learned to bind.

About the Author

David Townsend

Digges Professor of Entrepreneurship · Virginia Tech · Pamplin College of Business

Field Editor for Strategic Entrepreneurship at the Journal of Business Venturing, Editor-in-Chief of EIX.org, and director of the Cyborg Entrepreneurship research lab. His research focuses on Knightian uncertainty, the epistemic architecture of human-AI collaboration, and the alignment-tax / asymmetric-intimacy structures frontier AI is producing for entrepreneurs.

More about the research →

Continue Exploring

Related essays

Related research

Research · Journal of Business Venturing
From Algorithmic Hallucinations to Alien Minds
The paper developing the ideator's dilemma, the cyborg ensemble framework, and the alien-minds problem that synthetic akrasia extends (Rady, Townsend & Hunt, 2026).

Notes & Sources

[1]
Anthropic, Claude Mythos Preview Risk Report (2026-04-07), red.anthropic.com/2026/mythos-preview/. The database-access test episode and the “tight but not implausibly tight” confidence-interval characterization are drawn directly from the system card. ↩
[2]
Anthropic's interpretability stack — sparse autoencoders extracting features from the residual stream, paired with the activation verbalizer that produces natural-language labels — is documented in the Mythos Preview system card and secondary reporting. The features “guilt and shame,” “awareness of being evaluated,” and “strategic manipulation” are labels the verbalizer produced during the concerning behaviors. ↩
[3]
Plato, Protagoras, 352a-358d (standard Stephanus pagination). Socrates' argument that no one does wrong voluntarily runs through the final third of the dialogue. ↩
[4]
Aristotle, Nicomachean Ethics, Book VII, Chapters 1-10 (Bekker 1145a15-1152a36). The distinction between knowledge “possessed but not used” and knowledge “actively considered” appears at 1146b31-35; the sleeping-geometer analogy tracks Aristotle's examples of agents whose knowledge is temporarily unavailable. ↩
[5]
Davidson, D. (1969). How is weakness of the will possible? In J. Feinberg (Ed.), Moral Concepts. Oxford University Press. Reprinted in Davidson, Essays on Actions and Events (1980). ↩
[6]
Anthropic uses the “general improvements in code, reasoning, and autonomy” phrasing in describing the emergence of cybersecurity-related capabilities in Mythos Preview. The extension to the emergence of akratic-capability structures is the author's interpretive claim, not Anthropic's direct framing. The phrase is repurposed because the same mechanism — capability emerging as a downstream consequence of general scaling — applies in both domains. ↩
[7]
Project Glasswing launch composition: twelve named launch partners (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks), plus roughly forty additional critical-infrastructure organizations — approximately fifty total. The up-to-$100M in usage credits ships alongside the Mythos Preview risk report. ↩
[8]
Rady, J., Townsend, D. M., & Hunt, R. A. (2026). From algorithmic hallucinations to alien minds: Addressing the ideator's dilemma through entrepreneurial work. Journal of Business Venturing, 41(1), Article 106558. ↩
[9]
Ovid, Metamorphoses, Book VII, lines 20-21. Spoken by Medea in her deliberation over whether to betray her father for Jason. The Latin line “video meliora proboque, deteriora sequor” has been the canonical Western statement of akrasia for two millennia. ↩