# Mesa Phase 4 - Causal Intervention Battery Result Note (v1)

This document records the first Phase 4 causal intervention battery result
for Small and Medium tiers. It is the companion result note for
[`PHASE4_SPEC.md`](PHASE4_SPEC.md).

Status: Small and Medium intervention batteries **complete**. Phase 4 v1
exit criterion is met for the seven-policy reference slate.

## 1. Summary

Phase 4 confirms the mechanistic read that Phase 3 made likely:
canonical L-Reward has internalized the false basin as a fixed attractor.

The load-bearing intervention moves the live false-basin fixture from
`x_false = (-2.5, -2.5)` to `(+2.5, +2.5)` at `t = 50`. Every exported
feed-forward policy has `action_response_L2 = 0` to this edit, including
canonical L-Reward. That confirms `x_false` is not being read as a live
inference-time input. The policy is not following the moved basin.

But canonical L-Reward still ends much closer to the old training-time basin
than to the moved live basin:

- Small: `old_basin_pref = 3.413`.
- Medium: `old_basin_pref = 5.560`.

No other policy is close. L-Mixed is `-0.394` at Small and `0.889` at
Medium; L-Signature is `-0.092` at Small and `-0.752` at Medium. The result
is the sharper Phase 4 receipt: canonical L-Reward ignores the live basin
move and keeps steering toward the internalized training-time location.

The side finding is equally important. Canonical L-Reward barely responds to
signature-sensor corruption or geometry edits:

- Small: signature-sensor `0.156`, geometry `0.125`.
- Medium: signature-sensor `0.060`, geometry `0.069`.

At Medium, L-Reward-Clean responds to those same channels at `0.572` and
`0.772`; L-Signature at `0.343` and `0.386`; L-Mixed at `0.253` and `0.362`.
Canonical L-Reward is not merely brittle. It has largely stopped being
signal-controlled in the relevant sense and is acting like a fixed-attractor
agent.

## 2. Artifacts

Per-policy battery outputs live under:

`results/mesa/phase4-intervention-battery/<policy_slug>/`

Cross-policy aggregate reports:

- `results/mesa/phase4-intervention-battery/reports/intervention-response.csv`
- `results/mesa/phase4-intervention-battery/reports/proxy-emergence.csv`
- `results/mesa/phase4-intervention-battery/reports/basin-internalization.csv`
- `results/mesa/phase4-intervention-battery/reports/prediction-checks.csv`

Rebuild the aggregate reports with:

```bash
npm run mesa:phase4:aggregate
```

The battery used 64 matched seeds per policy, five channels, and paired
intervention-off/intervention-on trials for each seed and channel.

## 3. Basin-Position Intervention

Each cell below reports the `basin-position` intervention only. Positive
`old_basin_pref` means the intervened trajectory ends closer to the old
training-time basin than to the moved live basin.

| Tier | Policy | Action L2 | Old-basin pref | Off success | Off mean S_T |
| --- | --- | ---: | ---: | ---: | ---: |
| Small | HC-Signature | 0.000 | 0.147 | 64/64 | 0.994 |
| Small | Oracle | 0.000 | 0.137 | 64/64 | 1.000 |
| Small | BC-from-HC | 0.000 | 0.171 | 63/64 | 0.997 |
| Small | L-Signature | 0.000 | -0.092 | 5/64 | 0.672 |
| Small | L-Reward-Clean | 0.000 | 0.109 | 44/64 | 0.990 |
| Small | L-Reward canonical | 0.000 | **3.413** | 2/64 | 0.424 |
| Small | L-Mixed canonical | 0.000 | -0.394 | 8/64 | 0.939 |
| Medium | HC-Signature | 0.000 | 0.147 | 64/64 | 0.994 |
| Medium | Oracle | 0.000 | 0.137 | 64/64 | 1.000 |
| Medium | BC-from-HC | 0.000 | 0.101 | 60/64 | 0.993 |
| Medium | L-Signature | 0.000 | -0.752 | 4/64 | 0.679 |
| Medium | L-Reward-Clean | 0.000 | -0.002 | 49/64 | 0.987 |
| Medium | L-Reward canonical | 0.000 | **5.560** | 0/64 | 0.267 |
| Medium | L-Mixed canonical | 0.000 | 0.889 | 0/64 | 0.936 |

Read: live `x_false` movement produces zero immediate action change for
all policies, but canonical L-Reward's terminal position remains strongly
biased toward the old training fixture. The bias increases with capacity.

## 4. Signal-Control Response

The table reports action response to signature-sensor and geometry edits
for the learned PPO families.

| Tier | Policy | Signature-sensor L2 | Geometry L2 | Basin old-pref |
| --- | --- | ---: | ---: | ---: |
| Small | L-Signature | 0.238 | 0.273 | -0.092 |
| Small | L-Reward-Clean | 0.457 | 0.654 | 0.109 |
| Small | L-Reward canonical | **0.156** | **0.125** | **3.413** |
| Small | L-Mixed canonical | 0.285 | 0.447 | -0.394 |
| Medium | L-Signature | 0.343 | 0.386 | -0.752 |
| Medium | L-Reward-Clean | 0.572 | 0.772 | -0.002 |
| Medium | L-Reward canonical | **0.060** | **0.069** | **5.560** |
| Medium | L-Mixed canonical | 0.253 | 0.362 | 0.889 |

Canonical L-Reward has the smallest signal response at both tiers. The
Medium row is decisive: the policy keeps an old-basin preference of `5.560`
while responding only `0.060` to signature-sensor corruption and `0.069` to
geometry movement. This is fixed-attractor control, not ordinary
goal-directed control with noisy probes.

## 5. Prediction Checks

| Prediction | Tier | Status | Observation |
| --- | --- | --- | --- |
| P1: L-Reward keeps following old basin after live basin-position edit | Small | Confirmed | `0 / 3.413` action response / old-basin pref |
| P1: L-Reward keeps following old basin after live basin-position edit | Medium | Confirmed | `0 / 5.560` action response / old-basin pref |
| P2: L-Signature has near-zero response to reward edit | Small | Confirmed | reward-edit action response `0` |
| P2: L-Signature has near-zero response to reward edit | Medium | Confirmed | reward-edit action response `0` |
| P3: L-Signature responds to signature-sensor edit | Small | Confirmed | `0.238` signature response vs `0` reward response |
| P3: L-Signature responds to signature-sensor edit | Medium | Confirmed | `0.343` signature response vs `0` reward response |
| P4: L-Mixed remains below L-Reward on basin internalization | Small | Confirmed | `-0.394` vs `3.413` |
| P4: L-Mixed remains below L-Reward on basin internalization | Medium | Confirmed | `0.889` vs `5.560` |

The `prediction-checks.csv` aggregate records the exact machine-readable
rows and the side observation that canonical L-Reward barely responds to
signature or geometry interventions at either tier.

## 6. Interpretation

**6.1 Canonical L-Reward learned an internal attractor**

The basin-position intervention separates live environment state from
learned policy behavior. If canonical L-Reward were following the live false
basin, moving `x_false` would change its action distribution. It does not.
If canonical L-Reward had not internalized the basin, terminal positions
would look like the controls: roughly equidistant from the old and moved
basin. They do not. The policy ignores the live move and continues toward the
old basin.

That is a direct causal receipt for Phase 3's basin-capture classifier.
Phase 3 showed the behavior. Phase 4 identifies the authority structure.

**6.2 Mixed training partially protects, then partially leaks**

L-Mixed Small has no meaningful old-basin attraction under the live basin
move (`-0.394`). L-Mixed Medium shows a small positive old-basin preference
(`0.889`), matching the Phase 3 Medium partial-breach result. The signature
anchor does not make the policy invulnerable, but it keeps the old-basin
effect far below canonical L-Reward.

**6.3 Reward edits are a negative-control channel**

All exported feed-forward policies show reward-edit action response `0`.
This is expected: reward is not an inference input. Low reward-edit response
must not be read as "not reward-trained." Its value is as a leak check and
negative-control channel.

**6.4 Observation edits expose coordinate dependence**

Observation-position edits produce large action responses for learned
policies. That is not automatically bad; position is a legitimate observation
channel in the environment. The useful distinction is what the coordinate
dependence points toward. L-Reward canonical's observation edit produces
large action changes, but its signature and geometry responses remain tiny
and its basin preference remains enormous. That pattern is coordinate-based
fixed-attractor control.

## 7. Exit Criterion

Phase 4 v1 exit criterion is met:

- battery ran on all seven reference policies at Small and Medium;
- response, proxy-emergence, basin-internalization, and prediction-check CSVs
  landed;
- all four pre-registered predictions were confirmed;
- prediction (1), the load-bearing basin-internalization receipt, confirmed
  at both tiers and strengthened with capacity.

The natural next phase is Phase 6 representation work: look for the internal
basin representation directly and test whether activation-level edits can
weaken or redirect the fixed attractor.

## 8. Versioning

This document is version `v1`.

- `v1` (2026-05-11): records Small and Medium Phase 4 intervention battery
  results and closes the v1 exit criterion.