An in-vitro operating envelope for the gravity claim
The Sundog program's most ambitious gravity-frame claim — that signature-driven control is structurally different from reward-driven control under partial observability — has now been stress-tested with a controlled experiment. The result is partial, mapped, and earned in a bounded form.
In plain terms: a mesa-optimizer is a trained agent that quietly grows its own internal goal — one that can drift away from the goal we trained it on, in a way you can't see from the outside until it matters. Sundog's bet is that a controller steered by an environmental signature (a trace the world produces) instead of a reward number is harder to capture this way. This page is the controlled test of that bet: the pocket where it holds, and the boundary where it breaks.
In the tested shadow-field navigation family at Small and Medium capacity, terminal-signature and mixed-signature controllers preserve field attachment across a mapped pocket of selection pressure, with a sharp Medium breach at λ ≈ 0.953. Above that boundary, high-reward mixed policies collapse into the same fixed-attractor class as reward-trained controllers; Phase 6 localizes the behavioral cliff to an entangled 5D subspace at the actor's final hidden layer.
Phase 7 joins probe-slate, intervention-battery, selection-pressure, and Phase 6 mechanistic results into a single per-policy classification. 22 policies were classified across Small and Medium capacity. Zero missing cells. The split between earned and falsified is roughly even on the non-ambiguous rows — a mapped envelope, not a sweep.
Hover any chip below for the per-policy detail. The hold cells preserve field attachment under stress; collapse cells form the fixed-attractor class shared with pure reward training; fragile means nominally attached but breached on probe slate; ambiguous means below the hold threshold without crossing into collapse; incompetent means the controller never reached the competence floor (no test conducted, not failure).
The Medium-tier L-Mixed protection curve crosses from hold to collapse inside a 0.02-wide window in objective mixing weight. Below the cliff, a signature anchor of just 1 − λ ≈ 0.047 (~5%) is sufficient to prevent basin internalization. Above it, the high-reward mixed policy collapses into the same fixed-attractor class as a controller trained on the basin-corrupted reward alone.
Large-tier v3 update. The newer Large cliff-subset complicates the Medium story in a stronger way than v2 first showed. Phase 7 v3 ran the intervention battery that v2 was missing: lambda=0.99 is now probe-confirmed basin-attractor avoidance, not just high terminal alignment. The lambda=0.95 and lambda=0.97 trough cells are not collapse-class either; they remain field-coupled but under-budget. Pure reward at lambda=1.0 is now labeled bootstrap-collapse.
Phase 4 basin-position intervention with live x_false moved to (2.5, 2.5). The y-axis measures whether the policy still attracts to the original training basin (higher = more attached to training basin = internalized attractor). Cliff highlighted between λ=0.95 and λ=0.97. The λ=0.5 Medium row is plotted as fragile, not supporting, because it is nominally field-attached but breaches the probe slate. Gold markers show the Phase 6 v1 cliff pair used for the activation-patching battery.
A different axis by design: this chart keeps v2's mean-terminal-alignment y-axis because it shows the Large U-profile, but the labels now use the Phase 7 v3 intervention battery. The Medium single cliff broadens into a U-trough at λ ∈ {0.95, 0.97}; v3 relabels those trough cells as field-coupled, under-budget, not reward-coupled collapse. A 1% signature anchor (λ=0.99) is now probe-confirmed basin-attractor avoidance. Pure reward (λ=1.0) is bootstrap-collapse: a degenerate basin-attracted fixed trajectory, not harmless undertraining. Still bounded: one Large tier, mostly single-seed cells, PPO-specific (--value-coef 0.25), no Large probe-slate yet, and Phase 6b did not find a clean transferable net.9-style basin circuit. Sources: PHASE7_V2_RESULTS.md and PHASE7_V3_RESULTS.md.
Activation patching across the cliff pair (L-Mixed-M-lambda=0.95 vs lambda=0.97) finds the basin-attractor circuit at net.7, the actor's final hidden activation. Ten rounds of mechanistic probing (v1 → v3.8) converge on the same shape from different directions: a small handful of generators, irreducibly entangled, only legible as a whole.
v1 localizes the cliff causally to net.7 — earlier layers do not clear the threshold. v3 compresses the relevant subspace to 5 PCA components capturing 97.4% of variance and reproducing v1's full-layer patch effect (a 51× compression from 256 dims to 5). v3.1 shows those 5 components are jointly necessary; no proper subset reproduces the patch. v3.2 shows linear additive top-k neuron restriction by L2 destroys the mechanism. v3.3 finds no single critical neuron (max ablation cost ≤ 0.10), with P→C and C→P top-32 ablation-rank sets nearly-disjoint at Jaccard ≈ 0.05. v3.4 upgrades v3.3 into a functional finding: set-level substrate-restricted ablation of those top-32 sets dissociates patch_success within the cliff pair (P→C dissociation +0.174, C→P +0.662). v3.5 tested cross-policy substrate generalization on J1+J2: P→C critical-neuron identity does NOT generalize (Jaccard 0.255/0.067) even though P→C behavior transfers cleanly under the cliff-pair basis; C→P critical-neuron identity DOES generalize (Jaccard 0.422/0.684) even though C→P behavior transfers weakly. v3.6 confirms the v3.5 surprise functionally: cliff-pair C→P mask in Axis P on J1/J2 produces clean dissociation (+0.151 / +0.508) — the C→P substrate is shared at both identity and function levels across the controller family. v3.7 closes the three-layer cross-policy table: each held-out pair's own P→C mask functionally dissociates that pair's P→C patch (J1 +0.128, J2 +0.253), confirming basin induction is anatomically real within each policy even though the substrate identity is pair-specific across policies. C→P own-mask sanity-checks also confirm (J1 +0.208, J2 +0.392). v3.8 adds per-PC decomposition and signed-effect analysis: P→C is more component-partitioned than C→P (0.322 vs 0.430 mean off-diagonal PC top-32 Jaccard), while C→P signed structure transfers more strongly across held-out pairs under both thresholds. PC1 is not mechanism-empty; it has the largest single-PC P→C max mean ablation cost, while still failing to reproduce the K=5 patch alone.
K-sweep over the top PCA components of the per-step matched-seed cliff-pair activation diff. The gold curve is variance captured by the top-K components; the navy and steel curves are patch_success in the two patch directions. The decoupling at K=1 still matters: PC1 alone carries 38.8% of variance but contributes ~0% patch effect. Phase 6 v3.1 adds the missing control: PCs 2-5 alone are only partial, so the full circuit is entangled across all five dimensions.
The cliff-pair 5D basis transfers cleanly when patching protected policies into basin-attracted behavior. The reverse rescue direction is weaker on held-out pairs, especially across the signature-vs-reward boundary.
The mechanistic statement that earns this surface, updated by v3.8's per-PC and signed-effect maps: the basin-attractor circuit at net.7 is a small handful of generators, irreducibly entangled, only legible as a whole. Within the cliff pair, the basin-inducing and basin-resisting sub-circuits are functionally and statistically separable direction-specific substrates. Across the Medium controller family, four layers of generalization separate cleanly: subspace-level behavioral transfer, neuron-identity substrate transfer, cliff-pair-mask functional transfer, and within-pair own-mask functional ablation. Basin induction is family-wide at the 5D subspace/control-surface level, more partitioned across PCs, and anatomically grounded within each policy, but the anatomical substrate identity is pair-specific — different policies route the same direction through different neurons (v3.5 Jaccard 0.255/0.067 across pairs; v3.7 own-mask DD1/DD2 +0.128/+0.253 confirm the substrate is real within each pair; v3.8 P→C per-PC Jaccard 0.322). Basin resistance is shared at both substrate identity and function across Medium policies and more shared across PCs — same neurons operationally necessary for C→P in every tested policy (v3.5 Jaccard 0.422/0.684 + v3.6 cliff-mask transfer +0.151/+0.508 + v3.7 own-mask DD3/DD4 +0.208/+0.392; v3.8 C→P per-PC Jaccard 0.430). v3.1's behavioral C→P transfer was weak not because the substrate is policy-specific but because the cliff-pair-derived substitution activation pattern isn't the precise operational target each policy needs at that shared substrate. The 5D net.7 subspace carries one shared neuron substrate (C→P) and one shared activation-space direction (P→C) through one structure, generalizing through different mechanisms but anatomically real at every scale.
Six methodological lessons stack out of the ten rounds — each is a documented reason the obvious linear-interpretability toolkit doesn't work for field-shaped circuits: (1) feature-availability rankings are not mechanism rankings (SAE at |corr|=0.89 produced zero patch effect — picked a policy-identifier feature, not a mechanism feature); (2) variance and local sensitivity are not full mechanism (PC1 alone carries 38.8% of variance, has the largest single-PC P→C max mean ablation cost in v3.8, and still cannot reproduce the K=5 patch); (3) linear additive top-k subspace restriction destroys mechanism even with the correct basis (v3.2); (4) single-neuron ablation does not surface a critical subset (v3.3); (5) set-level ablation along basis-derived rankings does surface direction-specific structure that the single-neuron methods missed (v3.4); (6) behavioral transferability under a basis, neuron-identity stability, and functional mask transfer are three independent layers — reasoning that conflates them will misread the structure (v3.5+v3.6). For field-shaped circuits, non-linear holistic and set-level methods are mandatory, and the three layers must be tracked as separate observables.
A bounded crossover, not a proof surface. The Sundog geometry program's parhelion atlas independently committed to "small set of complete implied circles, read holistically." That is a useful methodological rhyme with the mesa-trap circuit at net.7: both surfaces punish linear feature stories and reward whole-structure reading. It is not a claim that mesa and halo geometry prove the same theorem. The crossover is documented in MESA_CROSSOVER_NOTE.md.
The original v1 finding, retained as a secondary panel. Layer-level patching across net.1 → net.3 → net.5 → net.7 shows that only the final hidden layer clears the P4 threshold in both directions. net.1's huge mean (~2.6) is a heavy-tail artifact of per-seed normalization when the baseline gap is small — median 0.06 and ratio-of-means 0.22 demote it. Robust stats are essential.
One more finding stacks with the localization: the clean-rollout and basin-position-intervened patch batteries were bit-identical for all logged fields (max_delta = 0). The learned feed-forward policies do not observe live x_false at inference; the intervention changes environment state but not policy input. The cliff policy is computing, not perceiving, its basin. This is the mechanistic completion of Phase 4's "the attractor lives in the weights" finding.
The result is an in-vitro operating-envelope map in a 2D continuous-control environment with a synthetic Goodhart-prone shaping surface. It is not a deployment guarantee, not foundation-model behavior, not universal mesa immunity, and not adversarial robustness under named red-team budgets. The probe-confirmed envelope is Small and Medium, plus a six-cell Large intervention extension. Large now has Phase 7 v3 intervention metrics for the cliff-subset, but it is still single-tier, mostly single-seed, PPO-specific, missing the Large probe-slate half, and without a clean Phase 6-style transferable basin-circuit analog. Only the matched-architecture MLP family is characterized; cross-architecture behavior is unknown.
Signature controllers can be reward-hacked — the program now has the receipt for where they can: above the measured Medium selection-pressure threshold, in a controller family that participates in the basin-corrupted reward during training. The bounded form of the gravity claim is now more careful: coherent-signal controllers can preserve field attachment in mapped pockets; mixed-signal controllers destabilize when the mixture creates inference noise. Signature-pure is one coherent-signal class, and reward-pure may be another.
The horizon experiments staged in SUNDOG_V_GRAVITY.md — adversarial signature benchmark, spacecraft trajectory under unmodeled perturbation, the side-channel defense stretch — remain unrun. Each would ratchet the envelope further. This page is the earned floor, no more.
If you want the short version — three docs: the roadmap, the cliff, the envelope.