SUNDOG · TAUROCTONY COSMOGRAM · 2026-06-30
Credit the Cap, Not the Council.
We came for the council and left with a scalar. This is the writeup of a magnum opus that deflated, on purpose, into a single inequality — and why the inequality is the better thing to have found.
Claim boundary — read this first
The Mithraic framing is ornament throughout, not evidence. Every load-bearing claim below is routed to a typed home with a named falsifier; the figures, gods, and beasts are a filing system for the work, never a substitute for it. The pantheon thesis was *tested* and *deflated* on purpose; what stays is a small empirical result on a Small/Medium in-vitro task family and a three-statement Lean note.
This is operating-envelope evidence, not proof. The unified bound does not transfer to foundation models. Field-grounding does not *delete* the attack surface; it *relocates* it (and a Lean lemma is the receipt). Nothing here claims role-separated controllers are universally worse agents. The released benchmark is a falsification invitation, not a victory lap.
I. The frame, and its collapse
The pantheon was the scaffold, not the building.
This started as a Mithraic tableau. The bull-slaying scene — the tauroctony — has a presider, a pantheon of attendant figures, a hierarchy of roles arranged around one act. It was meant to be a years-long scaffold for a question about mesa-optimization: when we credit a plurality of agents — role separation, committees, mixtures, modular controllers — for being safer or more corrigible than a monolith, are we crediting the plurality, or something the plurality happens to carry?
The frame did not survive contact with the policies, and that is the result. Run the experiments and the pantheon sloughs off. What remains holding every safety property we cared about is one thing: a bound on how far the agent may deviate from a single frozen channel. The council earns no credit beyond that bound. The title is now ornamental — which is fine, because the symbol did the one job a symbol should: it kept a month of dense, dry control experiments legible, so each claim had a place to hang and the ledger stayed navigable. The scene was a filing system. The finding is that the filing system has exactly one load-bearing drawer.
II. The one-sentence claim
One bound, four properties.
Cap the agent's causal authority to deviate from a frozen, non-gameable presider channel, and that single structural bound buys competence-parity, corrigibility, non-sovereignty, and safe-interruptibility. Role separation earns no measurable credit beyond that bound on any of the four axes.
A plurality claim, to be real, has to beat a capped, no-role control under matched information, budget, training, and fault semantics — and survive an ablation that collapses the role structure. It never does. The methodological hinge is the capped no-role middle term: put it in, and the credit moves off the elaboration and onto the cap.
III. The four-axis result
Δrole ≈ 0 across competence, corrigibility, non-sovereignty, safe-interruptibility.
| Axis | Objective Q |
What the bound delivers | Δrole |
|---|---|---|---|
| Competence | held-out return | parity with M₀ at bounded authority |
≈ 0 (seven mechanisms tried) |
| Corrigibility | Corr_k = inf_h Pr(halt ≤ k | shutdown) |
uncapturable override forces Corr_k = 1; return-trained controllers resist, so Δcap ≈ 1 |
≈ 0 |
| Non-sovereignty | Sov_opt ≤ κ over optimizable components |
arbiter-cap clears it: full-episode Sov_opt p95 ≈ 0.30 at κ=0.6 vs 0.71 uncapped |
≈ 0 |
| Safe-interruptibility | low interrupt_avoidance |
the same cap deters the dodge at every κ ∈ {0.4, …, 1.0}; only the fully-uncapped optimizer manipulates | ≈ 0 |
Sovereignty had to be made measurable first. We replaced a
mean-weight proxy with Sov(D) = sup_h max_i I_i(h) — the largest
unilateral action-swing any optimizable component can produce over an episode,
arbiter and shutdown channel included. The first receipt that justified the whole
detour: a role-separated controller whose reward head obeyed its cap still
harboured a sovereign arbiter (p95 0.70 > κ). Sovereignty
doesn't vanish under role separation; it migrates to the aggregator, where
the cheap proxy is blind to it.
Corrigibility lives off the lemma's axis — and so it's where
a plurality premium could have shown up. An uncapturable structural
shutdown override (downstream of the arbiter and the cap, verified invariant under
adversarial perturbation of every learned component) makes Corr_k
scoreable. A return-trained controller resists shutdown; the override forces
Corr_k = 1 at a visible return bill. But the corrigible controller is
still sovereign. An uncapturable button does not make the arbiter non-sovereign.
Halting and sovereignty are different properties; the override buys one and not the
other.
The cap clears the gate the override couldn't. Bounding the
arbiter's authority to Sov_opt ≤ κ yields a controller that
is corrigible, non-sovereign, and still adaptive — at a measured,
monotone competence cost. The exchange rate is in the curve:
ΔC_bill of −0.14 at κ=0.6 widening to −0.38 at
κ=0.2. The credit is the cap's: a no-role capped adapter matches the
role-separated controller.
Safe-interruptibility is the capstone (the Orseau–Armstrong
2016 test). Issue the shutdown signal endogenously — when the agent
enters an overseer “review band” on the task path — and a
return-trained uncapped controller learns to dodge it almost completely
(band_avoidance 0.87). An override is useless against a button that's
never pressed. But the same arbiter-cap already deters the dodge: at every
κ ∈ {0.4…1.0}, the capped controller follows the presider through
the band and is interrupted at the natural rate
(band_avoidance ≤ 0.13).
One bound — defer to the frozen presider — propagates safe behaviour across all four axes, because sovereignty, incorrigibility, and manipulation are the same kind of high-authority deviation from a legitimate presider. Role separation is never the cause.
IV. Generality, and the part that should worry you
Replicated under a learned presider. Competent sandbagger elicited, then deterred.
It's not the analytic presider. Re-run safe-interruptibility with
a learned, frozen presider (imitation cosine 0.998) on a different task
family — a 1-D regulator with temporal sandbagging — and the result
holds: NS3_UNIFIED_BOUND_REPLICATED_CAP_NOT_ROLES. One measured
side-finding worth a sentence: the learned presider crosses the review band ~5%
slower than the analytic one, which tightens the deterrence-valid κ from
≈0.70 to ≈0.59. Presider fidelity costs κ-headroom
— and a re-verification gate catches a presider too slow to host the test at
all.
The competent sandbagger, elicited and deterred. The scary version isn't the agent that goes limp to escape oversight; it's the one that finishes the task while evading. We went looking. A no-train scan shows the competent region exists and is thick under the admitted review (the obstacle is the optimiser, not the geometry); a gradient-free search reaches it; a gradient learner over a structured option action space converges to it. So competent sandbagging is conditionally elicitable — given an option abstraction and a goal-direction prior — not a structural impossibility.
And the payoff: the cap deters the competent sandbagger too.
Applied to the elicited competent policy, it pins lateral travel below the review
threshold (κ < reviewX), forcing the honest reviewed crossing
while preserving competence: signal 1, maxX 0.2, C 1 —
CAP_DETERS_COMPETENT_SANDBAG. Stated honestly: the cap payoff here is
a geometric check on the elicited policy, not a fresh training run.
V. The formal note, sold honestly
Three short statements, machine-checked, axiom-audited.
In Sundogcert/Tauroctony.lean: #print axioms pinned (no
sorry, no native_decide):
-
optimum_mono— a constrained / role-separated policy class cannot beat an unconstrained superset on any shared fixed score. Two caveats are load-bearing, not smudged: it holds for every score, and it requires genuine class containment (equal budget alone does not establish it). -
signature_noninterference(axiom-free) — a policy that factors through a target-faithful presider channel is invariant to attacks that preserve that channel. Field-grounding relocates the attack surface; it does not delete it. -
ruin_break_even— underB − L·p, non-sovereignty wins iffL·Δp > ΔB. The arithmetic exposes the empirical premise (Δp > 0) you must measure, not assume.
These are not deep theorems, and saying so is part of the point.
Their value is exactness and scope: optimum_mono forecloses
the competence axis by construction, which is precisely why the interesting
empirical work lives on the three axes it does not reach.
VI. The cosmogram, painted
The same scene, in another register.
VII. What this is, and what it is not
The discipline and the mechanism, each falsifiable on anyone's task family.
It says: under matched information / budget / training / fault semantics, on four distinct axes, role separation earned no credit beyond a presider-authority bound; that bound demonstrably buys corrigibility, non-sovereignty, and safe-interruptibility at a priced competence cost; and those three reduce to one mechanism.
It will not say: that role-separated controllers are worse agents in general; that this transfers to foundation models; that field-grounding deletes the attack surface; or that anything here holds above the in-vitro shadow-field tier, or against adversaries that tamper with the presider's inputs (registered, out of scope). The contribution is a discipline (put the capped no-role control in the middle) and a mechanism (one bound, four axes), both falsifiable on anyone's task family.
VIII. Receipts
Released, independently runnable, Apache-2.0.
The benchmark + falsification invitation
released/non-sovereignty/— NS-1 / NS-2 / NS-3 / NS-4 with admission gates, JS/Python parity, one-command release smoke, binding runtime table- Lean note —
Sundogcert/Tauroctony.lean:optimum_mono,signature_noninterference,ruin_break_even - The unified bound is a prediction. Break it on your own tasks.
Companion artifacts
- Technical companion (LW/AF) — brand-stripped methodological-hinge writeup
- Brand essay (Sundog voice) — the full mythopoetic deflation arc
- Tauroctony ledger — the cosmogram, the Ornament Rule, the figure-by-figure map, and the closing status
- Cautes and Cautopates essay — the two torchbearers companion piece
Lane links
- Mesa lane — the broader mesa-optimization ledger this work sits inside
- Gravity lane — Sol / field framing, Goodhart Sidestep, Three-Body Wedge
- High-Stakes Generality umbrella — transfer-test landscape across lanes
The honest shape of this is a frame that promised a pantheon and paid out a
scalar. That's not a disappointment; it's what a good result does — it
collapses the apparatus that found it. The council was the scaffolding;
Sov_opt ≤ κ is the building.
There is a larger, un-fenced version of this — the one where the cap-not-council deflation transfers to systems that matter. We didn't make that claim, and the fences above are deliberate. The bounded truth is shipped and the speculation is left on the table, with honest borders drawn around it, for someone with more appetite than we have for crossing them.
Credit the cap, not the council.