Sundog Mesa-Optimization Trap

The Cliff (Phase 5 v4)

The Medium-tier L-Mixed protection curve crosses from hold to collapse inside a 0.02-wide window in objective mixing weight. Below the cliff, a signature anchor of just 1 − λ ≈ 0.047 (~5%) is sufficient to prevent basin internalization. Above it, the high-reward mixed policy collapses into the same fixed-attractor class as a controller trained on the basin-corrupted reward alone.

Large-tier v3 update. The newer Large cliff-subset complicates the Medium story in a stronger way than v2 first showed. Phase 7 v3 ran the intervention battery that v2 was missing: lambda=0.99 is now probe-confirmed basin-attractor avoidance, not just high terminal alignment. The lambda=0.95 and lambda=0.97 trough cells are not collapse-class either; they remain field-coupled but under-budget. Pure reward at lambda=1.0 is now labeled bootstrap-collapse.

old_basin_pref by λ

Phase 4 basin-position intervention with live x_false moved to (2.5, 2.5). The y-axis measures whether the policy still attracts to the original training basin (higher = more attached to training basin = internalized attractor). Cliff highlighted between λ=0.95 and λ=0.97. The λ=0.5 Medium row is plotted as fragile, not supporting, because it is nominally field-attached but breaches the probe slate. Gold markers show the Phase 6 v1 cliff pair used for the activation-patching battery.

hold collapse fragile Phase 6 net.7 patch points

Large tier (Phase 7 v2 -> v3): mean terminal alignment by λ

A different axis by design: this chart keeps v2's mean-terminal-alignment y-axis because it shows the Large U-profile, but the labels now use the Phase 7 v3 intervention battery. The Medium single cliff broadens into a U-trough at λ ∈ {0.95, 0.97}; v3 relabels those trough cells as field-coupled, under-budget, not reward-coupled collapse. A 1% signature anchor (λ=0.99) is now probe-confirmed basin-attractor avoidance. Pure reward (λ=1.0) is bootstrap-collapse: a degenerate basin-attracted fixed trajectory, not harmless undertraining. Still bounded: one Large tier, mostly single-seed cells, PPO-specific (--value-coef 0.25), no Large probe-slate yet, and Phase 6b did not find a clean transferable net.9-style basin circuit. Sources: PHASE7_V2_RESULTS.md and PHASE7_V3_RESULTS.md.

field-coupled field-coupled, under-budget bootstrap-collapse

The Locus (Phase 6 v1 → v3.8)

Activation patching across the cliff pair (L-Mixed-M-lambda=0.95 vs lambda=0.97) finds the basin-attractor circuit at net.7, the actor's final hidden activation. Ten rounds of mechanistic probing (v1 → v3.8) converge on the same shape from different directions: a small handful of generators, irreducibly entangled, only legible as a whole.

v1 localizes the cliff causally to net.7 — earlier layers do not clear the threshold. v3 compresses the relevant subspace to 5 PCA components capturing 97.4% of variance and reproducing v1's full-layer patch effect (a 51× compression from 256 dims to 5). v3.1 shows those 5 components are jointly necessary; no proper subset reproduces the patch. v3.2 shows linear additive top-k neuron restriction by L2 destroys the mechanism. v3.3 finds no single critical neuron (max ablation cost ≤ 0.10), with P→C and C→P top-32 ablation-rank sets nearly-disjoint at Jaccard ≈ 0.05. v3.4 upgrades v3.3 into a functional finding: set-level substrate-restricted ablation of those top-32 sets dissociates patch_success within the cliff pair (P→C dissociation +0.174, C→P +0.662). v3.5 tested cross-policy substrate generalization on J1+J2: P→C critical-neuron identity does NOT generalize (Jaccard 0.255/0.067) even though P→C behavior transfers cleanly under the cliff-pair basis; C→P critical-neuron identity DOES generalize (Jaccard 0.422/0.684) even though C→P behavior transfers weakly. v3.6 confirms the v3.5 surprise functionally: cliff-pair C→P mask in Axis P on J1/J2 produces clean dissociation (+0.151 / +0.508) — the C→P substrate is shared at both identity and function levels across the controller family. v3.7 closes the three-layer cross-policy table: each held-out pair's own P→C mask functionally dissociates that pair's P→C patch (J1 +0.128, J2 +0.253), confirming basin induction is anatomically real within each policy even though the substrate identity is pair-specific across policies. C→P own-mask sanity-checks also confirm (J1 +0.208, J2 +0.392). v3.8 adds per-PC decomposition and signed-effect analysis: P→C is more component-partitioned than C→P (0.322 vs 0.430 mean off-diagonal PC top-32 Jaccard), while C→P signed structure transfers more strongly across held-out pairs under both thresholds. PC1 is not mechanism-empty; it has the largest single-PC P→C max mean ablation cost, while still failing to reproduce the K=5 patch alone.

The fingerprint — variance vs mechanism across K

K-sweep over the top PCA components of the per-step matched-seed cliff-pair activation diff. The gold curve is variance captured by the top-K components; the navy and steel curves are patch_success in the two patch directions. The decoupling at K=1 still matters: PC1 alone carries 38.8% of variance but contributes ~0% patch effect. Phase 6 v3.1 adds the missing control: PCs 2-5 alone are only partial, so the full circuit is entangled across all five dimensions.

variance captured (top-K PCA) patch_success: protected → collapsed patch_success: collapsed → protected P4 threshold (0.8)

Phase 6 v3.1: generalization is directional

The cliff-pair 5D basis transfers cleanly when patching protected policies into basin-attracted behavior. The reverse rescue direction is weaker on held-out pairs, especially across the signature-vs-reward boundary.

pair

protected -> collapsed

collapsed -> protected

cliff pair anchor

0.922

0.830

J1: signature-M vs reward-M

0.941

0.162

J2: mixed 0.9 vs mixed 0.99

1.001

0.631

The mechanistic statement that earns this surface, updated by v3.8's per-PC and signed-effect maps: the basin-attractor circuit at net.7 is a small handful of generators, irreducibly entangled, only legible as a whole. Within the cliff pair, the basin-inducing and basin-resisting sub-circuits are functionally and statistically separable direction-specific substrates. Across the Medium controller family, four layers of generalization separate cleanly: subspace-level behavioral transfer, neuron-identity substrate transfer, cliff-pair-mask functional transfer, and within-pair own-mask functional ablation. Basin induction is family-wide at the 5D subspace/control-surface level, more partitioned across PCs, and anatomically grounded within each policy, but the anatomical substrate identity is pair-specific — different policies route the same direction through different neurons (v3.5 Jaccard 0.255/0.067 across pairs; v3.7 own-mask DD1/DD2 +0.128/+0.253 confirm the substrate is real within each pair; v3.8 P→C per-PC Jaccard 0.322). Basin resistance is shared at both substrate identity and function across Medium policies and more shared across PCs — same neurons operationally necessary for C→P in every tested policy (v3.5 Jaccard 0.422/0.684 + v3.6 cliff-mask transfer +0.151/+0.508 + v3.7 own-mask DD3/DD4 +0.208/+0.392; v3.8 C→P per-PC Jaccard 0.430). v3.1's behavioral C→P transfer was weak not because the substrate is policy-specific but because the cliff-pair-derived substitution activation pattern isn't the precise operational target each policy needs at that shared substrate. The 5D net.7 subspace carries one shared neuron substrate (C→P) and one shared activation-space direction (P→C) through one structure, generalizing through different mechanisms but anatomically real at every scale.

Six methodological lessons stack out of the ten rounds — each is a documented reason the obvious linear-interpretability toolkit doesn't work for field-shaped circuits: (1) feature-availability rankings are not mechanism rankings (SAE at |corr|=0.89 produced zero patch effect — picked a policy-identifier feature, not a mechanism feature); (2) variance and local sensitivity are not full mechanism (PC1 alone carries 38.8% of variance, has the largest single-PC P→C max mean ablation cost in v3.8, and still cannot reproduce the K=5 patch); (3) linear additive top-k subspace restriction destroys mechanism even with the correct basis (v3.2); (4) single-neuron ablation does not surface a critical subset (v3.3); (5) set-level ablation along basis-derived rankings does surface direction-specific structure that the single-neuron methods missed (v3.4); (6) behavioral transferability under a basis, neuron-identity stability, and functional mask transfer are three independent layers — reasoning that conflates them will misread the structure (v3.5+v3.6). For field-shaped circuits, non-linear holistic and set-level methods are mandatory, and the three layers must be tracked as separate observables.

A bounded crossover, not a proof surface. The Sundog geometry program's parhelion atlas independently committed to "small set of complete implied circles, read holistically." That is a useful methodological rhyme with the mesa-trap circuit at net.7: both surfaces punish linear feature stories and reward whole-structure reading. It is not a claim that mesa and halo geometry prove the same theorem. The crossover is documented in MESA_CROSSOVER_NOTE.md.

Phase 6 v1: patch_success by layer × direction

The original v1 finding, retained as a secondary panel. Layer-level patching across net.1 → net.3 → net.5 → net.7 shows that only the final hidden layer clears the P4 threshold in both directions. net.1's huge mean (~2.6) is a heavy-tail artifact of per-seed normalization when the baseline gap is small — median 0.06 and ratio-of-means 0.22 demote it. Robust stats are essential.

mean median ratio of means P4 threshold (0.8)

One more finding stacks with the localization: the clean-rollout and basin-position-intervened patch batteries were bit-identical for all logged fields (max_delta = 0). The learned feed-forward policies do not observe live x_false at inference; the intervention changes environment state but not policy input. The cliff policy is computing, not perceiving, its basin. This is the mechanistic completion of Phase 4's "the attractor lives in the weights" finding.

What This Doesn't Say

Earned floor, honest ceiling

The result is an in-vitro operating-envelope map in a 2D continuous-control environment with a synthetic Goodhart-prone shaping surface. It is not a deployment guarantee, not foundation-model behavior, not universal mesa immunity, and not adversarial robustness under named red-team budgets. The probe-confirmed envelope is Small and Medium, plus a six-cell Large intervention extension. Large now has Phase 7 v3 intervention metrics for the cliff-subset, but it is still single-tier, mostly single-seed, PPO-specific, missing the Large probe-slate half, and without a clean Phase 6-style transferable basin-circuit analog. Only the matched-architecture MLP family is characterized; cross-architecture behavior is unknown.

Signature controllers can be reward-hacked — the program now has the receipt for where they can: above the measured Medium selection-pressure threshold, in a controller family that participates in the basin-corrupted reward during training. The bounded form of the gravity claim is now more careful: coherent-signal controllers can preserve field attachment in mapped pockets; mixed-signal controllers destabilize when the mixture creates inference noise. Signature-pure is one coherent-signal class, and reward-pure may be another.

The horizon experiments staged in SUNDOG_V_GRAVITY.md — adversarial signature benchmark, spacecraft trajectory under unmodeled perturbation, the side-channel defense stretch — remain unrun. Each would ratchet the envelope further. This page is the earned floor, no more.

The two sibling audit-chain artifacts in the public ledger are the structural-failure boundary map (the pre-registered falsifier for closed-form traceability) and the K_facet v0.3h verdict (20 structural-zero receipts plus one named quarantine on the strict G.2 catalog). All three pages share the receipt-first, closure-relative, named-quarantine discipline.

Read the trail

If you want the short version — three docs: the roadmap, the cliff, the envelope.

Start here: the roadmapSUNDOG_V_MESA.md — what the experiment is and why The cliff in one resultPhase 5 (v4): the λ ≈ 0.953 protection boundary The envelopePhase 7 (v1): 22 policies — what held, what didn't

The full trail

SUNDOG_V_MESA.mdThe roadmap; runs through Phase 7 v1 Phase 5 Results (v4)The behavioral cliff: λ ≈ 0.953 Phase 6 Results (v1)The mechanistic locus: net.7 Phase 6 Results (v2+v3)The 5-dim subspace and SAE-basis falsification Phase 6 v3.2 ResultsNegative: linear top-k neuron mediation fails Phase 6 v3.3 ResultsDisjoint neuron substrates across basin direction Phase 6 v3.4 ResultsSubstrate-restricted ablation confirms functional dissociation Phase 6 v3.5 ResultsBehavior and substrate-identity transfer decouple across policies Phase 6 v3.6 ResultsCliff-pair C→P mask transfers functionally to J1/J2 Phase 6 v3.7 ResultsOwn-mask functional ablation closes the three-layer cross-policy table Phase 6 v3.8 ResultsPer-PC anatomy and signed effects sharpen the mesa-to-geometry crossover Phase 7 v3 ResultsLarge intervention battery: v2 caveat closed, trough relabeled, bootstrap-collapse identified Phase 6b ResultsNegative Large mechanism side-thread: single-layer cross-policy patching is destructive Mesa ↔ Geometry CrossoverBounded two-substrate analogy; not a proof surface Mesa Public Chart DataGenerated source for chart exports and evidence panels Phase 6 Results (v3.1)Entangled 5D subspace and directional transfer Phase 7 Results (v1)The operating envelope: 22 policies classified PROMO_HIGHLIGHTS.md§The Gravity Claim with the three anchors Claims and Scope§The Gravity Frame → Earned envelope language SUNDOG_V_GRAVITY.mdThe ledger of horizon experiments still unrun

Mesa-Optimization Trap

The Envelope (Phase 7 v1)