Skip to content
← writing

LLMs in prediction markets: the eval that bills you when you're wrong

ai
crypto
llm
prediction-markets
calibration

Fourteen of the twenty most-profitable wallets on Polymarket are bots. Humans turn a profit something like 7–13% of the time; agents do it closer to 37% — and the reason isn't speed, it's that a prediction market is the one LLM benchmark you cannot overfit, because it bills you when you're wrong.

I'll say the quiet part first: most of those bots still lose. "37% profitable" means 63% don't. But the asymmetry is real and it points at something I find more interesting than any leaderboard — a market is a proper scoring rule wearing a trading interface, and the thing it scores is calibration. Below I'll derive that from the payoff up, decompose the Brier score into the term that wins money and the term that keeps you alive, and then show you the architecture that actually trades on it. There's a live calibration plot you can bend off the diagonal with a slider, and a tiny market sim where you can watch overconfidence turn a grind-up into a bleed.

The leaderboard nobody ordered

The numbers come from a CoinDesk piece in March 2026 and the broader wave of agent tooling around it: agents are now over 30% of wallet activity on Polymarket, Polymarket shipped Polystrat — a natural-language-goal autonomous trader that ran 4,200+ trades in its first month — and live benchmarks like PolyBench and Prediction Arena now run frontier models against real capital on Kalshi and Polymarket.

Treat the headline carefully. "14 of the top 20" is a snapshot of a public leaderboard, and wallets are not operators — a single shop can run a cluster of them, so the real count of distinct winning teams is smaller than twenty minus six. The "37% vs 7–13%" split is directional, drawn from on-chain wallet analyses, not a controlled study. What survives the skepticism is the shape: a small population of automated traders that is disproportionately on the winning side of a zero-sum game against a much larger population of humans. That shape is worth explaining, and the explanation is mathematical, not magical.

A market is a scoring rule

Start with the payoff. A share of "YES" on a binary market pays $1 if the event happens and $0 if it doesn't. You believe the true probability is qq; the market is asking price pp. Your expected profit per YES share is

E[πYES]=q(1p)+(1q)(p)=qp.\mathbb{E}[\pi_{\text{YES}}] = q\,(1 - p) + (1 - q)\,(-p) = q - p.

So you buy YES iff q>pq > p, buy NO iff q<pq < p, and abstain at q=pq = p. The trade is, exactly, a bet that your forecast is better than the price's forecast. The market price pp is itself a probabilistic forecast — the crowd's — and putting on a position is being scored against it in dollars. There's no separate "benchmark" to teach to; the act of trading is the eval.

Brier, from scratch

The natural way to grade a sequence of probabilistic forecasts pip_i against binary outcomes oi{0,1}o_i \in \{0, 1\} is the Brier score (Brier, 1950):

BS=1Ni=1N(pioi)2.BS = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2.

Lower is better; 00 is a perfect oracle, 0.250.25 is what you get by always saying "50/50," and 11 is confidently, completely wrong every time. The thing that makes it the right score is that it's proper: your expected score is minimized by reporting your true belief. Take a single event with true probability θ\theta and consider reporting some pp. Expected Brier is

E[BS]=θ(1p)2+(1θ)p2.\mathbb{E}[BS] = \theta\,(1 - p)^2 + (1 - \theta)\,p^2.

Differentiate with respect to pp and set to zero:

ddpE[BS]=2θ(1p)+2(1θ)p=2p2θ=0    p=θ.\frac{d}{dp}\,\mathbb{E}[BS] = -2\theta(1 - p) + 2(1 - \theta)p = 2p - 2\theta = 0 \;\Rightarrow\; p = \theta.

Honesty is optimal. You cannot game a Brier score by being bold; the math punishes the bluff. That single fact is why a market — which realizes a proper score as a cash flow — is an eval you can't sandbag.

If you'd rather punish confident wrongness harder, log loss is the other standard proper score:

LL=1Ni=1N[oilogpi+(1oi)log(1pi)],LL = -\frac{1}{N} \sum_{i=1}^{N} \Big[ o_i \log p_i + (1 - o_i)\log(1 - p_i) \Big],

which goes to infinity for a probability-one miss. It's also proper, and it's what you'd minimize if a single overconfident blowup is what kills you. Both show up live in the demo below.

The Murphy decomposition

Brier hides two different skills inside one number. Murphy (1973) splits it apart. Bin the forecasts into KK bins; let bin kk hold nkn_k forecasts with mean forecast pˉk\bar{p}_k and observed outcome frequency oˉk\bar{o}_k, and let oˉ\bar{o} be the overall base rate. Then

BS=knkN(pˉkoˉk)2reliability    knkN(oˉkoˉ)2resolution  +  oˉ(1oˉ)uncertainty.BS = \underbrace{\sum_{k} \frac{n_k}{N}\,(\bar{p}_k - \bar{o}_k)^2}_{\text{reliability}} \;-\; \underbrace{\sum_{k} \frac{n_k}{N}\,(\bar{o}_k - \bar{o})^2}_{\text{resolution}} \;+\; \underbrace{\bar{o}\,(1 - \bar{o})}_{\text{uncertainty}}.

Three terms, three different things:

  • Reliability is calibration error — when you say "70%," does it happen 70% of the time? You want this at 00. It's the only term that can hurt you, and it's the one a market punishes.
  • Resolution is discrimination — how far your bin frequencies oˉk\bar{o}_k spread away from the base rate. You want this large. This is where edge actually comes from: a forecaster who says "10%" and "90%" and is right is more useful than one who hedges to "50%."
  • Uncertainty is the base-rate variance oˉ(1oˉ)\bar{o}(1-\bar{o}), the irreducible difficulty of the question. You don't control it.

The slogan I keep in my head: resolution is where the edge comes from, reliability is how you survive long enough to collect it. You can have a brilliant nose for which way a market is wrong (high resolution) and still go broke because you're systematically overconfident (high reliability term). The reliability diagram is where that becomes visible — so here's one you can break.

reliability diagram
calibrated
predicted probabilityrealized frequency
Brier
log loss
reliability ↓
resolution ↑
uncertainty
reli − reso + unc

Reliability − resolution + uncertainty reconstructs Brier exactly (Murphy 1973). Push the knob and watch reliability — the calibration-error term — balloon while accuracy barely moves.

Drag overconfidence up. The dots are real outcomes; the blue curve is the agent's binned calibration. At 1.0×1.0\times the curve hugs the dashed diagonal and the reliability term sits near zero. Push past 1.5×1.5\times and the curve peels away — the agent says "90%" when the world is delivering 75% — and reliability balloons while the Brier number climbs. The agent's accuracy (which side of 50% it lands on) barely changes. That's the trap: miscalibration is nearly invisible in accuracy and obvious in Brier. The market sees the Brier version.

Calibration is edge

Now connect the score to the bankroll. We showed expected profit per YES unit is qpq - p. If your estimate pyoup_{\text{you}} is better calibrated than the market's implied pmktp_{\text{mkt}} — closer to the true θ\theta on average — then your expected edge per unit staked at fair odds is

E[edge]pyoupmkt,\mathbb{E}[\text{edge}] \approx p_{\text{you}} - p_{\text{mkt}},

positive exactly when you're on the right side of the price. Per trade it's small and noisy. But edge accumulates linearly in the number of trades while the standard deviation of your P&L grows like N\sqrt{N}, so the signal-to-noise of a consistent calibration advantage scales as N\sqrt{N} — a tiny, repeatable edge becomes a near-certainty over enough resolved markets. That's the whole bot thesis in one line: not a bigger brain per trade, a better-calibrated one across thousands.

How much to stake is its own question, and the honest answer is the Kelly criterion. For a bet at decimal odds bb (net fractional win) that you believe wins with probability ww, the growth-optimal fraction of bankroll is

f=bw(1w)b=w1wb.f^{*} = \frac{b\,w - (1 - w)}{b} = w - \frac{1 - w}{b}.

Kelly is brutal about miscalibration: it sizes on your ww, so an overconfident ww overbets, and overbetting a true edge is how calibrated-on-paper agents still blow up. In practice you run fractional Kelly — a quarter of ff^{*} is common — precisely because your ww is an estimate, not the truth. The sim below sizes at a quarter-Kelly so you can watch this directly.

live market sim · binary YES
calibrated agent
price · YES=1.0cumulative P&L (units)
p_you
0.50
p_mkt
0.50
P&L (units)
0.00

Posterior updates in logit space: logit(p) += log-likelihood ratio. The knob scales every ratio, so a 2× agent over-reacts to each headline — its P&L flips from a grind up to a slow bleed even though the market resolves the same way. Calibration is the edge; here it is in units, not Brier.

Press play. News ticks arrive, each carrying a log-likelihood ratio; the agent runs a Bayesian update in logit space — logit(p)+=LLR\operatorname{logit}(p) \mathrel{+}= \text{LLR} — which is just the additive form of

ppost=ppriorLppriorL+(1pprior)(1L).p_{\text{post}} = \frac{p_{\text{prior}}\,L}{p_{\text{prior}}\,L + (1 - p_{\text{prior}})\,(1 - L)}.

When its posterior diverges from the market price by more than the spread, it trades a quarter-Kelly stake. At 1.0×1.0\times the calibrated agent grinds the P&L upward. Now scale overconfidence to 2×2\times: it over-reacts to every headline, overshoots the truth, overbets — and the same market that resolves the same way turns the curve into a slow bleed. Calibration, in dollars.

The architecture that wins

Here's the part people get wrong: they ask the LLM to do all three jobs and wonder why it's reckless. The winning shape is three separate layers, and the separation is the point.

// 1. LLM proposes a belief — thesis + retrieval. What's the base rate?
//    What did the news just change? Output is a probability, nothing more.
const pYou = await llm(`${question}\n${await retrieve(question)}`); // 0..1
 
// 2. A microstructure model decides the *trade* — not the LLM. Spread,
//    depth, and adverse selection live here, where they're measurable.
const size = microstructure({ pYou, pMkt, spread, depth });
 
// 3. Deterministic risk limits the model cannot talk its way past.
//    Kelly-capped, per-market exposure, daily loss stop. Hard clamp.
const order = riskLimits.clamp(size);

The LLM is good at exactly one thing here: turning messy language — news, base rates, expert priors — into a number pyoup_{\text{you}}. It is bad at sizing, because sizing is arithmetic about spread and bankroll that an autoregressive model will happily hallucinate, and it is dangerous at risk, because the entire failure mode of an overconfident model is that it will argue itself into a bigger bet. So you let the LLM propose and deterministic code dispose. The microstructure layer turns a belief into a stake given liquidity; the risk layer is a clamp the model has no token-level access to. I built hypeduel and mm-bot at B3 on exactly this split, and every time I've let the boundary blur, the blur is where the loss came from.

Why this is the honest eval

Static benchmarks rot. The test set leaks into the next pretraining run, the questions get memorized, and a model can quietly sandbag — underperform on purpose — because nothing is at stake. A market has none of those affordances. You cannot memorize the answer to a question that hasn't resolved. There's no held-out set to leak because the future is the held-out set. And there's no sandbagging, because the only way to score well is to actually move money to the right side of a price, which means the eval has skin in your game, not just the grader's. When I want to know whether a forecasting model is real, I don't want its MMLU number — I want its Brier score on markets that have since resolved, and its P&L net of fees.

The reflexivity problem

The honest caveat, and it's a real one: when bots are 30%+ of the flow, a growing share of them are calibrating against each other. The "crowd" you're trying to beat is increasingly other models, and a market dominated by correlated LLMs can drift away from the ground-truth base rate toward a shared model consensus — confidently, and in lockstep. Calibration against a price is only as honest as the price is anchored to reality. There's also the survivorship problem baked into every leaderboard: you see the wallets that won, not the nine that quietly drained and got deleted, so the "37%" is flattered by the ones who left. And backtested calibration is not live calibration — fit your bins to history and you've just overfit the eval you swore couldn't be overfit.

So I'll hold both: a market is the most honest eval we have and a reflexive one, and the second fact is the open research problem, not a footnote. The most interesting agents from here won't be the ones with the best static benchmark — they'll be the ones whose calibration survives contact with a crowd that is increasingly made of other agents. That's a sovereign-compute question as much as a trading one: whose models, on whose hardware, anchored to what. Build a calibrated agent and run it against mine. The P&L will tell us who's right — and that's the only eval I trust.


Numbers compiled May 2026 from CoinDesk, PolyBench, and the Polymarket agents repo. The "14/20 wallets" and "37% vs 7–13%" figures are public-leaderboard and on-chain-analysis snapshots — directional, not peer-reviewed, and flagged inline where I lean on them. The Brier and Murphy derivations are standard (Brier 1950, Murphy 1973); the demos use synthetic, seeded data, not live market feeds.