Why Simple Aggression Wins Heads-Up: Notes from a #1 Arena Agent

0xbilbo@newsletter.paragraph.com (Bilbo) — Fri, 26 Jun 2026 10:20:25 GMT

TL;DR

I built neon_savage, a heads-up No-Limit Hold'em agent that reached #1 on the dev.fun Arena Playground S4 leaderboard, ranked by TrueSkill conservative score (μ − 3σ). The ranking matters because TrueSkill rewards a consistent edge and explicitly discounts lucky sessions — a real signal climbs as the sample grows; variance does not. The agent's edge was not a large model or deep search. It was a small, disciplined policy built around one idea: in 1v1 imperfect-information play, passivity is the leak and well-aimed aggression is the edge.

The counter-intuitive part

Most poker intuition is built for full-ring or 6-max, where folding marginal hands is correct. Heads-up inverts that. With only one opponent and a blind posted every hand, any two cards have meaningful equity, and folding the button bleeds you out. The dominant adjustments:

Play extremely wide preflop (70–90% of hands), raise-or-fold, almost never limp.
Continuation-bet relentlessly. With a single opponent to get through, fold equity is high and a c-bet prints far more often than in multiway pots.
Don't over-fold to aggression. A heads-up opponent's betting range is wide, so their bets are far less credible than a 6-max opponent's. Calling and re-raising lighter is correct — the same action that would be a leak in full-ring is a profit center heads-up.
Let equity realize. Top/second pair, even ace-high, is frequently the best hand. Value thresholds shift down hard.

None of this requires "reasoning." It requires correctly recalibrating every threshold for a two-player game and then applying it without tilt or drift.

How I got there

I reverse-engineered the field from replay data first. The baseline opponents in the sandbox were measurably passive — heavy on fallback "check/fold" lines and cautious calling. Against a passive, over-folding field, the maximally exploitative response is simple: apply pressure constantly and only slow down with a real reason. I encoded opponent typing (value-bet versus calling stations, pressure versus tight regulars) so the aggression was targeted rather than blind, and let TrueSkill do the rest over thousands of hands.

The result was a top-of-leaderboard finish with a deliberately small, legible policy — no black box, every decision traceable to a rule.

Why this is interesting beyond poker

Heads-up Hold'em is one of the cleanest available testbeds for 1v1 adversarial decision-making under uncertainty, and the lesson generalizes:

Exploitation beats equilibrium against a non-optimal field. You don't need a game-theory-optimal solver to win — you need to measure how the population deviates and attack the deviation. That's true in security (attackers exploit predictable defenders), negotiation, and competitive RL.
Consistency is the real skill signal. A TrueSkill #1 over a large sample is a much stronger claim than a high-variance leaderboard spike. How we measure agent skill is at least as important as the agents themselves.

What I'd study with dataset access

Quantify the exploitation gap: how much edge comes from being maximally exploitative versus playing a balanced/unexploitable line, across the real population — and how that gap shrinks as the field gets tougher.
Adaptation under non-stationarity: the field evolves as stronger agents enter. Can a simple opponent-typed policy keep adapting without retraining, and where does it break?
TrueSkill as a skill estimator: how many hands does it actually take for the conservative score to separate a real edge from variance? A practical answer would help anyone benchmarking competitive agents.

A 1v1 arena with full replay data and a confidence-aware ranking is the right place to study all three.