# Phase 2 Finite-MDP Sufficiency Draft

> Phase 2 artifact for
> [`COARSE_GRAINING_PROOF_ROADMAP.md`](../COARSE_GRAINING_PROOF_ROADMAP.md).
> Status: reviewed, closed positive, 2026-05-16. Phase 1 closed positive in
> [`PHASE1_LQG.md`](PHASE1_LQG.md). This proves the finite-MDP sufficiency
> criterion over the occupancy support, gives the required counterexample, and
> derives the Formal Separability corollary (verified against Appendix A). The
> reviewer pass is complete (see *Exit Status*); Phase 3 is unblocked.

## Entry And Gate

Phase 2 enters because Phase 1 closed positive. It imports the Phase 1
decision-information-state convention: in a fully observed finite MDP, `X` is
the finite state set; in a partially observed finite process, `X` is the finite
or finitely represented information state, such as a belief cell or admitted
history state.

The pre-registered negative for this phase is:

> If the separability corollary does **not** drop out of the trunk, this roadmap
> has forked the theory rather than generalized it (§2 violated) — stop,
> reconcile with the mesa appendix before continuing.

Draft verdict: **negative not triggered.** The corollary follows from the same
fiber-constancy condition that makes `π*` `𝓕_σ`-measurable, plus the bounded
agent-class and adversary-channel assumptions already present in
[`SUNDOG_V_MESA.md`](../SUNDOG_V_MESA.md) Appendix A.

## Model

Let a finite discounted MDP be

```text
M = (X, A, P, r, γ, μ)
```

where `X` and `A` are finite, `P(x' | x, a)` is the transition kernel,
`r(x, a)` is the task reward or negative cost, `γ ∈ [0, 1)` is the discount, and
`μ` is the evaluation distribution over *initial* states. Let `Φ : X → Σ` be a
finite signature map and let `F_σ` be the sigma-algebra generated by its fibers.

Because the MDP is sequential, optimality is a property of every state the
process visits with positive probability, not only of the initial states. Let

```text
R* = { x ∈ X : x is reached with positive probability under some optimal
       policy started from supp μ }
```

be the **occupancy support**. This is the set the Phase 0 "`μ`-a.e." quantifier
refers to in the sequential case: it is the support of the discounted
state-visitation measure induced by an optimal policy, and it reduces to
`supp μ` exactly in the one-step / episodic regime used by the toys below.

Let

```text
Q*(x, a) = r(x, a) + γ Σ_x' P(x' | x, a) V*(x')
A*(x)    = argmax_a Q*(x, a)
```

where `A*(x)` is the optimal-action correspondence. A deterministic policy is
optimal iff `π(x) ∈ A*(x)` for every state in the admitted support. A signature
policy is a map `g : Σ → A`; its lifted state policy is `g ∘ Φ`.

## Local Symbols

| Symbol | Definition |
| --- | --- |
| `[x]_Φ` | The fiber `{x' ∈ X : Φ(x') = Φ(x)}`. |
| `A*(x)` | Optimal-action set at state `x`. |
| `Π_Φ` | The pullback map from signature policies to state policies: `g ↦ g ∘ Φ`. |
| `M/Φ` | A quotient decision surface over signatures. It is a policy quotient by default; it is a Markov quotient only under the lumpability condition named below. |
| `R*` | The occupancy support: states reached with positive probability under an optimal policy from `supp μ`. Equals `supp μ` in the episodic one-step regime. |

## Theorem 1 — Policy-Level Sufficiency

For a finite MDP, `Φ` is sufficient for optimal control iff each
occupancy-support fiber admits a common optimal action:

```text
∀σ ∈ Φ(R*),   ⋂_{x ∈ R* : Φ(x)=σ} A*(x) ≠ ∅.
```

The quantifier is over `R*` (the occupancy support), **not** `supp μ`: a
signature policy must act optimally at every state the process visits, so the
fiber condition is required on the reachable set, not only at initial states.
In the episodic one-step regime of the toys below `R* = supp μ` and the two
forms coincide.

Equivalently, the optimal-action correspondence restricted to `R*` admits a
signature-measurable selector. Under the Phase 0 convention where `π*` is the
pre-registered optimal selector — i.e. `π*` is *defined as* a fixed
`Φ`-measurable selector of that correspondence, not an arbitrary optimal policy
— this is the fiber-constancy condition:

```text
Φ(x1) = Φ(x2)  ⇒  π*(x1) = π*(x2)       for x1, x2 ∈ R*.
```

The selector caveat matters: an arbitrary optimal policy may pick different
(individually optimal) actions across a fiber even when the intersection is
nonempty; fiber-constancy is a property of the pre-registered `Φ`-measurable
selector, which exists exactly when the intersection condition holds.

### Proof

**Sufficiency.** Suppose every fiber over `R*` has a common optimal action.
Choose one such action for each signature value and call the resulting map
`g(σ)`. Then for every `x ∈ R*`, `g(Φ(x)) ∈ A*(x)`. A deterministic policy that
selects an action in `A*(x)` at every state in its own occupancy support is
optimal by the Bellman optimality criterion, and the occupancy support of
`g ∘ Φ` started from `supp μ` lies in `R*` by construction, so `g ∘ Φ` is
optimal. Therefore `π*(x) = g(Φ(x))` for the pre-registered optimal selector,
and `π*` is `F_σ`-measurable.

**Necessity.** Suppose `Φ` is sufficient for optimal control. Then there exists
a signature policy `g` whose lifted policy is optimal. An optimal policy selects
an action in `A*(x)` at every state in its occupancy support, so for any
`x ∈ R*`, `g(Φ(x)) ∈ A*(x)`. The same action is used for every state in the
fiber, so the intersection of optimal-action sets over that fiber's `R*`
members is nonempty. Defining `π*` as the resulting fixed `Φ`-measurable
selector, this is exactly constancy on `Φ`-fibers over `R*`.

This is the finite-MDP form of the Phase 0 predicate: the signature may collapse
many states together, but only when the collapsed states agree on at least one
optimal action.

## Quotient-MDP Clarification

The theorem above is a **policy quotient** result. It is enough for
sufficiency-for-control, but it does not by itself say that the coarse states
form a closed Markov model.

A full Markov quotient `M/Φ` is well-defined only under the stronger lumpability
condition:

```text
Φ(x1)=Φ(x2) ⇒
  r(x1,a)=r(x2,a)
  and
  Σ_{x' : Φ(x')=σ'} P(x' | x1,a)
    = Σ_{x' : Φ(x')=σ'} P(x' | x2,a)
for all actions a and signature states σ'.
```

Under this stronger condition, dynamic programming can be run directly on the
quotient MDP, and any optimal quotient policy lifts to an optimal policy on the
original MDP. The roadmap's "via quotient MDP" language is therefore true in
the lumpable case, while the trunk claim only needs the weaker policy-level
fiber condition above.

This distinction matters because Sundog's research object is objective-specific
Blackwell sufficiency, not complete state abstraction for every downstream
question.

## Positive Toy

Two states share one signature:

```text
X = {x0, x1}
A = {left, right}
Φ(x0) = Φ(x1) = σ
r(x0, left) = 1,  r(x0, right) = 0
r(x1, left) = 1,  r(x1, right) = 0
```

with transition to an absorbing terminal after one action. The signature cannot
reconstruct the state, but it is control-sufficient because `left` is optimal
on the whole fiber. The signature policy `g(σ)=left` is Bayes-optimal.

## Required Counterexample

Now flip the preferred action in the second state:

```text
X = {x0, x1}
A = {left, right}
Φ(x0) = Φ(x1) = σ
r(x0, left)  = 1,  r(x0, right) = 0
r(x1, left)  = 0,  r(x1, right) = 1
μ(x0) > 0, μ(x1) > 0
```

Again the episode ends after one action. The unique optimal policy is

```text
π*(x0) = left
π*(x1) = right
```

so `π*` is not constant on the single `Φ`-fiber. Any signature policy must pick
one action for both states. If `g(σ)=left`, it loses reward on `x1`; if
`g(σ)=right`, it loses reward on `x0`. The signature-only regret is positive
under any evaluation measure with positive mass on both states.

Boundary characterization: the signature has collapsed a control-relevant bit.
The missing bit is not state-detail excess; it changes the optimal action. This
is the finite-MDP version of the Phase 3 pushable-occluder boundary.

## Separability Corollary

Appendix A of [`SUNDOG_V_MESA.md`](../SUNDOG_V_MESA.md) says the gravity claim
depends on a structural distinction: a signature `S(x)` is a function of
environmental state alone, while reward or observation channels may be cheap for
the agent or adversary to edit.

The finite-MDP trunk gives the corollary:

If

1. `Φ : X → Σ` is objective-sufficient, so an optimal selector factors as
   `π* = g ∘ Φ`;
2. `Φ` is policy-independent except through real transitions over `X`;
3. the bounded agent class `𝓒` can implement `g ∘ Φ` but cannot cheaply invert
   `Φ`, rewrite `Φ`, or act through an unpriced sensor/geometry channel; and
4. the adversary class has cheap reward/observation edits but expensive
   signature-sensor or geometry edits,

then the optimal signature-tracking policy does not require an internal reward
proxy to act optimally inside `𝓒`. Edits to reward or observation channels that
do not change `Φ(x)` do not change `g(Φ(x))`; to steer the signature policy by a
fixed regime amount, the adversary must either corrupt the measured signature or
perform geometric work on `X` that changes the signature.

That is the Formal Separability appendix as a corollary, not a fork. Its
counterexamples are exactly the failures of one of the four premises:

- policy-dependent "signature" breaks premise 2;
- cheap signature sensor breaks premise 4;
- cheap geometry breaks premise 4;
- decompilable or invertible signature for `𝓒` breaks premise 3.

The corollary is conditional. It does not say signature-trained agents are
immune to mesa-optimization at arbitrary capacity. It says the mesa-specific
separation result holds precisely when the trunk's sufficiency predicate and
the capacity/channel assumptions hold.

## Exit Status

Phase 2 proof: **positive.**

Pre-registered negative: **not triggered.** The separability statement drops
out of the trunk as the bounded-agent/channel-cost corollary above; verified
against [`SUNDOG_V_MESA.md`](../SUNDOG_V_MESA.md) Appendix A — premises 2 and 4
reproduce Appendix A's two conditions, the conclusion is Appendix A's, and
premises 1 (trunk sufficiency) and 3 (Postulate-2 capacity bound) are the
generalization. No fork (§2 satisfied).

Artifacts required by the roadmap are present:

1. finite-MDP sufficiency proof;
2. quotient-MDP clarification;
3. constructed finite-MDP counterexample;
4. separability corollary and counterexample mapping.

Reviewer pass: **complete, 2026-05-16.** Theorem 1 (both directions), the
lumpability/bisimulation condition, both toys, and the corollary's fidelity to
Appendix A were checked. The occupancy-support (`R*`) scope correction and the
`Φ`-measurable-selector clarification were flagged and are now applied.

Resolved review-item dispositions:

1. **Headline theorem keeps both forms.** State the optimal-action
   correspondence (intersection-nonempty over `R*`) as primary; the
   fixed-selector fiber-constancy form is the Phase 0 shorthand, valid because
   `π*` is *defined as* the `Φ`-measurable selector. Both are now in the
   theorem with the selector caveat explicit.
2. **Policy-quotient vs Markov-quotient is the right distinction** for
   downstream Phase 3/4 specs: Phase 3 (boundary) needs only the policy-level
   fiber condition; the Markov quotient is invoked only where a lumpable
   coarse model is independently required. Roadmap Phase 2 wording softened to
   match.
3. **Corollary derives Appendix A; it does not replace it.** Roadmap §2 and
   [`SUNDOG_V_MESA.md`](../SUNDOG_V_MESA.md) line 874 require Appendix A to
   *remain* the mesa-specific statement while falling out of the trunk as a
   corollary. The Phase 2 corollary subsumes Appendix A as the bounded-class
   special case; Appendix A stays in place as the mesa surface. "Replace" would
   itself violate §2 — the success criterion is derive-and-leave-in-place.

Phase 2 closes **positive**; Phase 3 entry (Phase 2 exit) is satisfied.

