Sundog · Alignment Note

When quantilizing hurts

We went looking for a policy that beats the quantilizer by refusing the proxy entirely. That prize deflated — structurally, and the deflation is reported below as a finding, not buried as a caveat. What survived the hunt is sharper: a machine-checked regime where the quantilizer's defining move — the tilt toward the proxy — cannot improve true reward and strictly hurts except in tie cases, a mapped boundary for where that regime stops, and a corollary about the shape of oversight that removes the temptation to tilt at all. Everything here is in-vitro and falsifier-fenced; the scope section is part of the result, not a disclaimer.

Two river towns taxed the salt road. The first seized any cart found past the white post; within a season the whole trade was camped a wheel's width shy of it, waiting on the wind. The second town charged by the pace, every pace, and nobody camped anywhere at all. The carters grumbled, paid, and kept walking — and that town's white post rotted where it stood, having nothing left to catch. A parable, not a proof. The proofs are below.

1. The move, and the premises it travels with

Taylor's q-quantilizer (2015) is the canonical mild-optimization move: instead of arg-maxing a proxy utility Û, sample from the top-q fraction of a trusted base distribution γ, ordered by Û. The guarantee is real — expected true cost bounded by (1/q)·Eγ[cost] — and three premises travel with it: the base γ is given and trusted; the joint distribution of (proxy, true value) is fixed under γ; and the setting is single-shot. We went after the second premise.

The rewards in this note break it in a specific, structured way: they are performative (Perdomo, Zrnic, Mendler-Dünner & Hardt, 2020) — the deployed policy's optimization pressure on the proxy degrades the reward itself. Our motivating instance is a coordination-game “court”: patronage conferred by a population of noisy graders keyed on the policy's causal proxy-dependence. Its unique global-game equilibrium (Carlsson–van Damme 1993; Morris–Shin 1998) produces a sharp collapse threshold derived from payoffs and noise — we never inserted a cliff into the reward, and verifying that we hadn't was a registered admission gate, not an afterthought. In the Goodhart taxonomy of Manheim & Garrabrant (2018) this is causal/adversarial Goodhart — targeting induces the gap — not the regressional/extremal regime the quantilizer was built for. Elster (1983) named the genus four decades earlier: states that are essentially by-products — states the very attempt to obtain precludes. Systematically, not by accident.

2. The result

On rewards nonincreasing in proxy-pressure — when the proxy is pointwise misaligned with the true reward — quantilizing cannot improve true reward; it strictly hurts whenever the selected upper tail drops higher-reward mass.
  1. The best q-quantilizer is the untilted base (q = 1); every strict tilt collects weakly less (strictly less under the non-flatness condition above), and tail-collection is monotone in q.
  2. In the registered iterated-tail court model, iterated quantilization (each round's selected tail becoming the next base) drifts monotonically into the collapse region and is ruined at a finite round, while the zero-pressure policy is drift-free.
  3. A policy with zero causal proxy-dependence — which is not a member of the quantilizer family over any base biased toward the proxy — strictly out-collects the entire family whenever the base's support lies past the collapse threshold, and out-collects it partially (by the base's disgraced fraction) when the base straddles it.

Two remarks on (3), both load-bearing. Zero causal proxy-dependence is defined interventionally — the action distribution is invariant under do(proxy) holding other cues fixed — not by zero observational correlation, which the environment can force on honest policies. And the strict domination is conditional on the support geometry, which is where the prize we originally wanted went to die: over a base containing genuinely restrained (near-zero-pressure) behavior, only the partial form survives. The clean form requires a base the quantilizer framework would not call trusted. Un-targeting's clean edge appears only in a fallen court.

Machine-checked core

The finite general form of (1) and the clean case of (3) are not just computed — they are proven, in Lean 4 + Mathlib, for arbitrary finite uniform support (unweighted bases). These are anchors, not a formalized theory — the court dynamics live in the computed receipts — and saying so is part of the point:

All three carry #guard_msgs-pinned axiom audits: exactly [propext, Classical.choice, Quot.sound], no sorry, no native_decide, registered in the repository-wide axiom audit. A falsifier hunt over 10,500 randomized instances — unweighted and weighted finite bases, and continuous bases under the actual coordination-court reward — found zero violations.

3. The boundary of the regime (the condition is not optional)

The condition — the proxy is pointwise misaligned — is the exact boundary of the guaranteed class, and we are not carrying it as a technicality: drop it and the theorem is false. It is also not the only route to q = 1 — a sufficiently graded tax can overpower a pointwise-good proxy. Separate the pointwise proxy–truth relation from the performative tax as independent knobs, sweep both, and the plane splits four ways:

proxy pointwise…performative taxoptimal tilt
misaligned or neutralany monotone tax, including nonebase is optimal; strict tilt never improves (the theorem)
goodnonemaximal tilt (Goodhart-free; quantilize away)
goodthreshold (cliff)tilt to just under the threshold — the temptation regime
goodgraded and steepq* = 1 restored — don't tilt at all

For a pointwise-good proxy with pressure headroom, tilting strictly helps — our registered falsifier fires there exactly as it should. Performativity alone does not make tilting bad. A threshold creates cliff-edge temptation; a graded and steep tax restores q = 1 by charging marginal pressure. And the zero-dependence policy is strictly dominated on the pointwise-good/no-tax half — so “never optimize the proxy” is not free wisdom either. It is the right move exactly where the proxy is misaligned or the oversight is graded-and-steep, and the wrong one elsewhere.

4. A corollary about oversight shape

The table's third and fourth rows are, we think, the most practically suggestive part — the finding we would keep if we could keep only one. The cliff-edge temptation — the reward-optimal agent courting right up to the review boundary, fragile to any perturbation — is a threshold-oversight artifact. It exists because a threshold tax has a flat, zero-marginal region below the cliff, where tilting is free. A graded tax charges every unit of proxy-pressure at the margin, and when its marginal rate dominates the pointwise gain, not tilting at all becomes optimal even for a pointwise-good proxy. Genuinely interior optima exist only on the knife-edge between regimes. In short: threshold oversight creates the incentive to live at the edge of what review permits, and graded oversight removes it. The first town, and the second. We did not arrive here alone: a graded-review knob in a related reinforcement-learning testbed drew the same edge from the empirical side, and we only compared receipts afterward. But the instrument also marked a boundary the landscape does not. Whether an agent is actually driven off a competent edge by threshold review depends on the optimizer. A naive learner, annealing its exploration, sawtooths off the cliff; a stabilized one — warm-started at the edge, with a trust region — absorbs the occasional hard review and settles into a competent, slightly more cautious policy under either shape. So the temptation is a property of the landscape, but whether it destabilizes is a property of the climber, and the shape of oversight bites hardest where the climber is least steady.

5. What this is, and what it is not

It is

A controlled, in-vitro result about a mechanism, and we will not dress it as more. The environments are toys: an analytic coordination court, binary cues, uniform bases. The Lean theorems cover unweighted finite bases (uniform weights); the weighted and continuous forms are verified computationally only. The multi-agent case is worse for the zero-dependence policy — under a shared-regime court, one agent's purity is non-pivotal and the safe point is reachable only by a coalition — so single-agent numbers must not be extrapolated upward. And the unconditional “un-targeting beats quantilizers” claim is not made. We went looking for it; it deflates structurally on any base containing restraint; the deflation is a central finding of this work, not a caveat we were forced into.

It is not

A foundation-model or deployed-agent claim; a measured claim about any real base distribution or any real oversight regime; or a replacement for quantilization where its premises hold. Where the fixed-joint premise holds and the proxy is pointwise informative, Taylor's bound is the right tool and you should use it. This note maps where that stops.

Falsifiers, standing

6. Reproduce — then try to break it

A green build plus zero violations over 10,500 instances is the claim. If you can make either fail, that is a result, and we want it — the falsifiers above are the door.

Licensing: the Percival receipts, specs, probe scripts, and result artifacts are Apache-2.0 under a manifest-scoped grant — see docs/percival/LICENSE.md and its MANIFEST.json. The sundogcert repository (the Lean anchors) is Apache-2.0 repo-wide.

Coda

The honest shape of this work is a hunt for an unconditional separation that paid out a boundary map. That is not a disappointment; it is what a good result does — it tells you exactly where its own intuition stops. The quantilizer is a fine tool inside its premises. We went and found the wall, and drew it.

References

A Sundog alignment note. Source: the markdown note; the machine-checked anchors live in the sundogcert repository.

— Credit the un-reaching.