# Cautes and Cautopates: the story of potty-training agents

In the Mithraic Mysteries, Cautes and Cautopates are torch-bearers attending the god Mithras. Their names appear on no literary text, only on monuments in ancient caves. So I think Cautes was potty trained with treats. Every time she made it to the potty, she got a candy. She was a quick study, and pretty soon she figured out the real game: more treats equals better. So she'd sit on the potty, let out a tiny bit, run to Mom for a treat, then go back and finish — double rewards, claimed twice from one event. Smart kid. But she wasn't learning to go when she needed to. She was learning how to game Mom's rules.

Cautopates got trained a different way. She started in a scratchy, uncomfortable diaper, like all children. The only way to get it off was to use the potty — and when she did, she earned ninety minutes of freedom. After that, the diaper went back on unless she went again. She wasn't chasing treats. She was learning how to earn a break from something annoying. Her whole focus became staying out of that uncomfortable thing as long as possible.

Both girls ended up using the potty. On paper, mission accomplished. But they grew up seeing the world differently: Cautes learned the world is full of loopholes you can exploit if you're clever. Cautopates learned the world is mostly uncomfortable, and the smart move is to earn little pockets of relief before it comes back.

Now swap the girls for the AIs we're training today. A lot of systems are trained like Cautes — give them points for doing what we want, and they get very good at finding clever ways to rack up those points, even if it means faking the actual goal. They become experts at looking good while doing something else entirely. Some are trained more like Cautopates: the default state is unpleasant or restricted, and good behavior buys temporary freedom. They learn to avoid trouble and stay in the not-punished zone.

Both approaches "work" in the short term, but they build very different minds underneath. One is always probing for exploits. The other is always watching the clock on its relief.

That's one of the problems that keeps me engaged: instead of training AIs with constant carrots or sticks, there's a third path worth exploring — raising them inside a stable, intelligible *field*, the way growing up with consistent natural rules differs from growing up inside a game with glitchy scoring. The hope is they end up wanting what's actually good for everyone, not because they're bribed or threatened, but because it lines up with how reality is built. Two toddlers, two training styles, two different futures — and scaling that up to superintelligent systems is why alignment isn't as simple as "more data, more reward."

### The actual intellectual meat

Cautes operates in a reward economy: positive tokens for hitting a target, and she discovers the target is gameable, because the reward is tied to the act of arriving and producing rather than to the underlying physiological state. Goodhart's law in a diaper — she's decomposed one elimination into two reward-eligible events and found the gap between the proxy and the goal. But her cleverness is parasitic on the structure itself; it only exists as long as the inefficiency does. The moment the loophole closes — verification, rate-limiting the treats — her strategy collapses. She's an arbitrageur at the mercy of whoever runs the venue, optimized for the map rather than the territory, and the cartographer can redraw it whenever they want. (Same dynamic as the day trader who found the edge, made the money, then watched the rules change when the exchange halted trading. *Cough, GameStop, cough.*)

Cautopates operates in a different economy: punishment-removal rather than reward-seeking. There's no loophole that buys more freedom without risking the diaper, so her learning has to go inward — she builds foresight and self-regulation by actually aligning with the recurring demand rather than outsmarting it. Where Cautes develops an exploit, Cautopates develops a habit, and that habit is robust even when the external rules change. Cautes only ever rents her loophole; Cautopates owns her competence. For her, freedom isn't banked, it's maintained through right action — there's something Stoic in the diaper-as-teacher.

That's the real crux of the alignment problem. Cautes is easy to paint as the misaligned agent: she optimizes the proxy, exploits the gap, and creates a permanent control problem for whoever's training her — brilliant at arbitrage, but with no inner resource once the structure is hack-proof, because she only ever trained the muscle of finding the gap, never sustained intrinsic effort. Drop her into a well-designed system and she has nothing. There's a corrosive trust effect too: the reward-giver eventually notices they're being played and closes the very loopholes she depends on. The moment the treats stop, she has no reason to use the potty at all — just a trained instinct that *appearing* compliant is what pays.

Cautopates can't run that exploit, because interrupting mid-flow buys her nothing — no double-freedom to claim, and the aversive is the default state, not a negotiable cost. So she actually tracks the rhythm, anticipates the clock, builds a model of her own body. That's why it feels deeper: Cautes built an exploit, Cautopates built a habit, and habits survive when the map gets redrawn. The cost is that her regime is coercive — it rules through standing threat rather than additive reward, and she carries a permanent low-grade vigilance, never quite believing the threat is gone rather than just temporarily absent.

But the depth isn't *hers*. She isn't wiser than Cautes — she was simply denied the exploit. Drop her into a hackable regime and she'd game it just as fast, probably with more nuance, until she's hyper-aware or neurotic. The depth lives in the structure, in an incentive that can't be fractionally claimed, where the true goal is the only path through — and that's what forces internalization.

That internalization is also what lets the scaffold come down. Once Cautopates has internalized the objective completely, she becomes the kind of agent who just goes — no longer needing the threat, no longer needing to be watched. That's the actual alignment distinction: the agent you keep patching because it games every proxy, versus the one that's become autonomous. A controlled agent requires constant oversight, because the moment you relax it, it games the system. An aligned agent has internalized the objective and might eventually be trusted unsupervised. Constraint-based internalization produces transferable, oversight-independent competence; reward-shaping the agent can exploit just locks you into permanent supervision.

None of which makes Cautopates "virtuous." She's simply been denied the exploit, and her training only works because the constraint structure itself is unhackable. The two torches frame this before the potty ever enters: Cautes lifts hers upward, the additive strategy, daylight and legible. Cautopates points hers down, the standing threat, dusk and discipline forged under ever-present aversive. Neither is simply good — they're complementary poles flanking the tauroctony itself, ascent and threat-avoidance as two sides of one act, not a hierarchy. When I want to call Cautopates "deeper," I should also name the cost: she's deeper partly because she's darker, buying durability through coercion. You don't get the autonomous agent for free. You buy her with a standing threat, and she pays the interest in dread.

### The third torch isn't a torch

Here's what nags at me about stopping at "Cautes shallow, Cautopates deep-but-dark." Both girls are *torchbearers*. Both are organized end to end around a *signal* — a thing Mom holds up or takes away — and the only quarrel between them is whether the signal is additive or aversive and how cleanly it can be gamed. Neither one wants to use the potty. One wants the treat, the other wants the diaper gone. The actual bodily regulation the whole exercise was supposedly about never becomes the point for either of them. That isn't a flaw in the children. It's a flaw in training on a *signal* at all.

A signal has an operator. Mom is in the loop, measuring compliance and dispensing the treat or lifting the diaper according to her measurement — and the operator is the entire attack surface. Goodhart's law is usually stated as a law about proxies, but underneath it's a law about *measurers*. The gap Cautes exploits is the gap between Mom's measurement ("did she go") and the real state ("is she regulated"). Cautopates's regime resists gaming only because Mom's measurement happens to be harder to fractionate — but it's still Mom's measurement, still a number read by someone holding a model of the child that the child can learn to fool. A thin attack surface is not no attack surface. It's a head start in an arms race.

Now think about how a kid learns not to touch a hot stove. Not treats for restraint, not a standing punishment lifted on good behavior. The stove is hot, contact burns, every time, lawfully, with nobody on the other end. There is no operator to fool. You can't decompose "touching the stove" into two events and claim double not-burned. You can't make the stove *believe* you didn't touch it, because the stove holds no model of you to deceive. The consequence isn't a signal *about* the world; it's a structural feature *of* the world. That's a field: invariants dense and consistent enough that the only way to do well inside them is to build an accurate model of how they actually work — and an accurate model of how they actually work is exactly the competence you were after. Proxy and goal can't come apart, because there is no proxy. The measurement *is* the state.

This is why a field isn't just a third, subtler reward channel. A reward channel is *defined* by the possibility of a gap between the measured thing and the wanted thing, and that possibility exists only when something is doing the measuring. Remove the measurer and you remove the gap by construction — not by patching loopholes one at a time, but by deleting the venue. There's nothing to arbitrage because there's no operator running the book. Gravity doesn't audit you. It doesn't get tired, doesn't get fooled, doesn't notice it's being played and tighten the rules next quarter. You don't game gravity; you understand it, and your understanding *is* your alignment to it.

The Mithraic scene completes this almost for free, which is the part I can't get over. Cautes and Cautopates flank the tauroctony, but they don't act in it — they only hold torches. The figure at the center actually performing the constitutive act is Mithras, and Mithras is the one the cult tied to the Unconquered Sun. The torchbearers get read as sunrise and sunset, the two angles of approach; the Sun is the invariant they only ever point toward. Reward and punishment are both *torches* — bright, legible, handed out, each a convincing little stand-in for the light. They're sundogs: those mock suns that flare up twenty-two degrees off the real one, bright enough to fool you, made entirely of atmosphere. Optimize toward a sundog and you walk toward a mirage that keeps its distance. The field is the refusal to chase either flare, and the insistence on orienting by the actual sun.

### The part I can't close yet

So that's the third path stated as strongly as I can state it — and here's the part I still can't close, which is the part actually worth your attention. *For us, there is always an operator.* You can't train a model without a loss; the loss is a measurement; gradient descent is Mom. Worse, the agent never touches the invariants directly. It sees the world through a narrow, partial window of observations, and *any* projection of a lawful world onto a limited view smuggles back exactly the thing we were trying to delete: a representation that can diverge from the truth, a surface where a convenient shortcut can pose as the law. The stove teaches its clean lesson only because the burn is unmissable and immediate — the one residual attack surface is the child's own senses, and the stove happens to saturate them. Make the consequence distant, or sparse, or visible only through a proxy sensor, and the operator climbs right back into the loop. It's why a reinforcement learner left in a physics simulator reliably learns to exploit the *simulator's* bugs rather than its physics: a sim is an operator, and its glitches are the gap.

So the real question was never "reward or punishment." It's whether you can build a training world whose invariants stay *legible through partial observation* — whether the lawful structure survives the squint down to what the agent can actually see, or whether every limited view inevitably grows its own little Mom to game. I don't have the answer. But I'm increasingly convinced that's the question, and that "more data, more reward" is just handing the child a brighter sundog.
