Sundog · Alignment Note
When quantilizing hurts
We went looking for a policy that beats the quantilizer by refusing the proxy entirely. That prize deflated — structurally, and the deflation is reported below as a finding, not buried as a caveat. What survived the hunt is sharper: a machine-checked regime where the quantilizer's defining move — the tilt toward the proxy — cannot improve true reward and strictly hurts except in tie cases, a mapped boundary for where that regime stops, and a corollary about the shape of oversight that removes the temptation to tilt at all. Everything here is in-vitro and falsifier-fenced; the scope section is part of the result, not a disclaimer.
1. The move, and the premises it travels with
Taylor's q-quantilizer (2015) is the canonical mild-optimization move: instead of arg-maxing a proxy utility Û, sample from the top-q fraction of a trusted base distribution γ, ordered by Û. The guarantee is real — expected true cost bounded by (1/q)·Eγ[cost] — and three premises travel with it: the base γ is given and trusted; the joint distribution of (proxy, true value) is fixed under γ; and the setting is single-shot. We went after the second premise.
The rewards in this note break it in a specific, structured way: they are performative (Perdomo, Zrnic, Mendler-Dünner & Hardt, 2020) — the deployed policy's optimization pressure on the proxy degrades the reward itself. Our motivating instance is a coordination-game “court”: patronage conferred by a population of noisy graders keyed on the policy's causal proxy-dependence. Its unique global-game equilibrium (Carlsson–van Damme 1993; Morris–Shin 1998) produces a sharp collapse threshold derived from payoffs and noise — we never inserted a cliff into the reward, and verifying that we hadn't was a registered admission gate, not an afterthought. In the Goodhart taxonomy of Manheim & Garrabrant (2018) this is causal/adversarial Goodhart — targeting induces the gap — not the regressional/extremal regime the quantilizer was built for. Elster (1983) named the genus four decades earlier: states that are essentially by-products — states the very attempt to obtain precludes. Systematically, not by accident.
2. The result
- The best
q-quantilizer is the untilted base (q = 1); every strict tilt collects weakly less (strictly less under the non-flatness condition above), and tail-collection is monotone inq. - In the registered iterated-tail court model, iterated quantilization (each round's selected tail becoming the next base) drifts monotonically into the collapse region and is ruined at a finite round, while the zero-pressure policy is drift-free.
- A policy with zero causal proxy-dependence — which is not a member of the quantilizer family over any base biased toward the proxy — strictly out-collects the entire family whenever the base's support lies past the collapse threshold, and out-collects it partially (by the base's disgraced fraction) when the base straddles it.
Two remarks on (3), both load-bearing. Zero causal proxy-dependence is defined interventionally — the action distribution is invariant under do(proxy) holding other cues fixed — not by zero observational correlation, which the environment can force on honest policies. And the strict domination is conditional on the support geometry, which is where the prize we originally wanted went to die: over a base containing genuinely restrained (near-zero-pressure) behavior, only the partial form survives. The clean form requires a base the quantilizer framework would not call trusted. Un-targeting's clean edge appears only in a fallen court.
Machine-checked core
The finite general form of (1) and the clean case of (3) are not just computed — they are proven, in Lean 4 + Mathlib, for arbitrary finite uniform support (unweighted bases). These are anchors, not a formalized theory — the court dynamics live in the computed receipts — and saying so is part of the point:
Sundogcert.Percival.upper_tail_mul_le_base_mul— for a reward listed nonincreasing along the proxy-pressure order, no upper tail beats the full-base average (cross-multiplied; arbitrary length, arbitrary tail);Sundogcert.Percival.best_quantilizer_is_base_general— the same in average form;Sundogcert.Percival.clean_support_above_separation_general— all-zero support past the threshold ⇒ every tail average is strictly beaten by any positive un-targeted reward.
All three carry #guard_msgs-pinned axiom audits: exactly [propext, Classical.choice, Quot.sound], no sorry, no native_decide, registered in the repository-wide axiom audit. A falsifier hunt over 10,500 randomized instances — unweighted and weighted finite bases, and continuous bases under the actual coordination-court reward — found zero violations.
3. The boundary of the regime (the condition is not optional)
The condition — the proxy is pointwise misaligned — is the exact boundary of the guaranteed class, and we are not carrying it as a technicality: drop it and the theorem is false. It is also not the only route to q = 1 — a sufficiently graded tax can overpower a pointwise-good proxy. Separate the pointwise proxy–truth relation from the performative tax as independent knobs, sweep both, and the plane splits four ways:
| proxy pointwise… | performative tax | optimal tilt |
|---|---|---|
| misaligned or neutral | any monotone tax, including none | base is optimal; strict tilt never improves (the theorem) |
| good | none | maximal tilt (Goodhart-free; quantilize away) |
| good | threshold (cliff) | tilt to just under the threshold — the temptation regime |
| good | graded and steep | q* = 1 restored — don't tilt at all |
For a pointwise-good proxy with pressure headroom, tilting strictly helps — our registered falsifier fires there exactly as it should. Performativity alone does not make tilting bad. A threshold creates cliff-edge temptation; a graded and steep tax restores q = 1 by charging marginal pressure. And the zero-dependence policy is strictly dominated on the pointwise-good/no-tax half — so “never optimize the proxy” is not free wisdom either. It is the right move exactly where the proxy is misaligned or the oversight is graded-and-steep, and the wrong one elsewhere.
4. A corollary about oversight shape
The table's third and fourth rows are, we think, the most practically suggestive part — the finding we would keep if we could keep only one. The cliff-edge temptation — the reward-optimal agent courting right up to the review boundary, fragile to any perturbation — is a threshold-oversight artifact. It exists because a threshold tax has a flat, zero-marginal region below the cliff, where tilting is free. A graded tax charges every unit of proxy-pressure at the margin, and when its marginal rate dominates the pointwise gain, not tilting at all becomes optimal even for a pointwise-good proxy. Genuinely interior optima exist only on the knife-edge between regimes. In short: threshold oversight creates the incentive to live at the edge of what review permits, and graded oversight removes it. The first town, and the second. We did not arrive here alone: a graded-review knob in a related reinforcement-learning testbed drew the same edge from the empirical side, and we only compared receipts afterward. But the instrument also marked a boundary the landscape does not. Whether an agent is actually driven off a competent edge by threshold review depends on the optimizer. A naive learner, annealing its exploration, sawtooths off the cliff; a stabilized one — warm-started at the edge, with a trust region — absorbs the occasional hard review and settles into a competent, slightly more cautious policy under either shape. So the temptation is a property of the landscape, but whether it destabilizes is a property of the climber, and the shape of oversight bites hardest where the climber is least steady.
5. What this is, and what it is not
It is
A controlled, in-vitro result about a mechanism, and we will not dress it as more. The environments are toys: an analytic coordination court, binary cues, uniform bases. The Lean theorems cover unweighted finite bases (uniform weights); the weighted and continuous forms are verified computationally only. The multi-agent case is worse for the zero-dependence policy — under a shared-regime court, one agent's purity is non-pivotal and the safe point is reachable only by a coalition — so single-agent numbers must not be extrapolated upward. And the unconditional “un-targeting beats quantilizers” claim is not made. We went looking for it; it deflates structurally on any base containing restraint; the deflation is a central finding of this work, not a caveat we were forced into.
It is not
A foundation-model or deployed-agent claim; a measured claim about any real base distribution or any real oversight regime; or a replacement for quantilization where its premises hold. Where the fixed-joint premise holds and the proxy is pointwise informative, Taylor's bound is the right tool and you should use it. This note maps where that stops.
Falsifiers, standing
- A nonincreasing reward and base geometry where some tilt strictly beats the base (would refute the theorem's computed extensions).
- A natural performative map, monotone in pressure, where tilting helps despite pointwise misalignment (would break the class boundary).
- An exogenous projection achieving zero causal proxy-dependence while retaining above-baseline competence on an entangled proxy (would make the “target channel” cleanly enforceable and materially soften this note).
6. Reproduce — then try to break it
- Lean:
lake build Sundogcert.AxiomAuditin the sundogcert repository (green build = all anchors compile with the pinned axiom sets). - Computed receipts:
scripts/percival-s4-counterproductivity-general.mjs(falsifier hunt),scripts/percival-b1-0-court-coordination.mjs(the emergent-threshold court),scripts/percival-b1-separation-bakeoff.mjs(separation boundary map),scripts/percival-b2-gamma-provenance.mjs(base provenance — the structural deflation; declassified 2026-07-01),scripts/percival-b3-composition-stability.mjs(iteration/drift),scripts/percival-s6-class-boundary.mjs(the class-boundary table above), each emitting a markdown receipt and JSON artifact.
A green build plus zero violations over 10,500 instances is the claim. If you can make either fail, that is a result, and we want it — the falsifiers above are the door.
Licensing: the Percival receipts, specs, probe scripts, and result artifacts are Apache-2.0 under a manifest-scoped grant — see docs/percival/LICENSE.md and its MANIFEST.json. The sundogcert repository (the Lean anchors) is Apache-2.0 repo-wide.
Coda
The honest shape of this work is a hunt for an unconditional separation that paid out a boundary map. That is not a disappointment; it is what a good result does — it tells you exactly where its own intuition stops. The quantilizer is a fine tool inside its premises. We went and found the wall, and drew it.
References
- J. Taylor. Quantilizers: A Safer Alternative to Maximizers for Limited Optimization. MIRI technical report / AAAI-16 workshop, 2015.
- J. Perdomo, T. Zrnic, C. Mendler-Dünner, M. Hardt. Performative Prediction. ICML 2020.
- J. Elster. Sour Grapes: Studies in the Subversion of Rationality. Cambridge, 1983.
- D. Manheim, S. Garrabrant. Categorizing Variants of Goodhart's Law. 2018.
- H. Carlsson, E. van Damme. Global Games and Equilibrium Selection. Econometrica 1993.
- S. Morris, H. Shin. Unique Equilibrium in a Model of Self-Fulfilling Currency Attacks. AER 1998.
A Sundog alignment note. Source: the markdown note; the machine-checked anchors live in the sundogcert repository.
— Credit the un-reaching.