LLMs in prediction markets: the eval that bills you when you're wrong

Fourteen of the twenty most-profitable wallets on Polymarket are bots. Humans turn a profit something like 7-13% of the time; agents do it closer to 37%, and the reason isn't speed. It's that a prediction market is the one LLM benchmark you cannot overfit, because it bills you when you're wrong.

I'll say the quiet part first: most of those bots still lose. "37% profitable" means 63% don't. But the asymmetry is real and it points at something I find more interesting than any leaderboard. A market is a proper scoring rule wearing a trading interface, and the thing it scores is calibration. Below I'll derive that from the payoff up, decompose the Brier score into the term that wins money and the term that keeps you alive, and then show you the architecture that actually trades on it. There's a live calibration plot you can bend off the diagonal with a slider, and a tiny market sim where you can watch overconfidence turn a grind-up into a bleed.

The leaderboard nobody ordered

The numbers come from a CoinDesk piece in March 2026 and the broader wave of agent tooling around it: agents are now over 30% of wallet activity on Polymarket, Polymarket shipped Polystrat (a natural-language-goal autonomous trader that ran 4,200+ trades in its first month), and live benchmarks like PolyBench and Prediction Arena now run frontier models against real capital on Kalshi and Polymarket.

Treat the headline carefully. "14 of the top 20" is a snapshot of a public leaderboard, and wallets are not operators. A single shop can run a cluster of them, so the real count of distinct winning teams is smaller than twenty minus six. The "37% vs 7-13%" split is directional, drawn from on-chain wallet analyses, not a controlled study. What survives the skepticism is the shape: a small population of automated traders that is disproportionately on the winning side of a zero-sum game against a much larger population of humans. That shape is worth explaining, and the explanation is mathematical, not magical.

A market is a scoring rule

Start with the payoff. A share of "YES" on a binary market pays $1 if the event happens and $0 if it doesn't. You believe the true probability is $q$ ; the market is asking price $p$ . Your expected profit per YES share is

\mathbb{E}[\pi_{\text{YES}}] = q\,(1 - p) + (1 - q)\,(-p) = q - p.

So you buy YES iff $q > p$ , buy NO iff $q < p$ , and abstain at $q = p$ . The trade is, exactly, a bet that your forecast is better than the price's forecast. The market price $p$ is itself a probabilistic forecast (the crowd's), and putting on a position is being scored against it in dollars. There's no separate "benchmark" to teach to; the act of trading is the eval.

Brier, from scratch

The natural way to grade a sequence of probabilistic forecasts $p_i$ against binary outcomes $o_i \in \{0, 1\}$ is the Brier score (Brier, 1950):

BS = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2.

Lower is better; $0$ is a perfect oracle, $0.25$ is what you get by always saying "50/50," and $1$ is confidently, completely wrong every time. The thing that makes it the right score is that it's proper: your expected score is minimized by reporting your true belief. Take a single event with true probability $\theta$ and consider reporting some $p$ . Expected Brier is

\mathbb{E}[BS] = \theta\,(1 - p)^2 + (1 - \theta)\,p^2.

Differentiate with respect to $p$ and set to zero:

\frac{d}{dp}\,\mathbb{E}[BS] = -2\theta(1 - p) + 2(1 - \theta)p = 2p - 2\theta = 0 \;\Rightarrow\; p = \theta.

Honesty is optimal. You cannot game a Brier score by being bold; the math punishes the bluff. That single fact is why a market, which realizes a proper score as a cash flow, is an eval you can't sandbag.

If you'd rather punish confident wrongness harder, log loss is the other standard proper score:

LL = -\frac{1}{N} \sum_{i=1}^{N} \Big[ o_i \log p_i + (1 - o_i)\log(1 - p_i) \Big],

which goes to infinity for a probability-one miss. It's also proper, and it's what you'd minimize if a single overconfident blowup is what kills you. Both show up live in the demo below.

The Murphy decomposition

Brier hides two different skills inside one number. Murphy (1973) splits it apart. Bin the forecasts into $K$ bins; let bin $k$ hold $n_k$ forecasts with mean forecast $\bar{p}_k$ and observed outcome frequency $\bar{o}_k$ , and let $\bar{o}$ be the overall base rate. Then

BS = \underbrace{\sum_{k} \frac{n_k}{N}\,(\bar{p}_k - \bar{o}_k)^2}_{\text{reliability}} \;-\; \underbrace{\sum_{k} \frac{n_k}{N}\,(\bar{o}_k - \bar{o})^2}_{\text{resolution}} \;+\; \underbrace{\bar{o}\,(1 - \bar{o})}_{\text{uncertainty}}.

Three terms, three different things:

Reliability is calibration error. When you say "70%," does it happen 70% of the time? You want this at $0$ . It's the only term that can hurt you, and it's the one a market punishes.
Resolution is discrimination, how far your bin frequencies $\bar{o}_k$ spread away from the base rate. You want this large. This is where edge actually comes from: a forecaster who says "10%" and "90%" and is right is more useful than one who hedges to "50%."
Uncertainty is the base-rate variance $\bar{o}(1-\bar{o})$ , the irreducible difficulty of the question. You don't control it.

The slogan I keep in my head: resolution is where the edge comes from, reliability is how you survive long enough to collect it. You can have a brilliant nose for which way a market is wrong (high resolution) and still go broke because you're systematically overconfident (high reliability term). The reliability diagram is where that becomes visible, so here's one you can break.

reliability diagram

calibrated

Brier

—

log loss

—

reliability ↓

—

resolution ↑

—

uncertainty

—

reli − reso + unc

—

Reliability − resolution + uncertainty reconstructs Brier exactly (Murphy 1973). Push the knob and watch reliability (the calibration-error term) balloon while accuracy barely moves.

overconfidence1.00×

bins10

Drag overconfidence up. The dots are real outcomes; the blue curve is the agent's binned calibration. At $1.0\times$ the curve hugs the dashed diagonal and the reliability term sits near zero. Push past $1.5\times$ and the curve peels away (the agent says "90%" when the world is delivering 75%) and reliability balloons while the Brier number climbs. The agent's accuracy (which side of 50% it lands on) barely changes. That's the trap: miscalibration is nearly invisible in accuracy and obvious in Brier. The market sees the Brier version.

Calibration is edge

Now connect the score to the bankroll. We showed expected profit per YES unit is $q - p$ . If your estimate $p_{\text{you}}$ is better calibrated than the market's implied $p_{\text{mkt}}$ , closer to the true $\theta$ on average, then your expected edge per unit staked at fair odds is

\mathbb{E}[\text{edge}] \approx p_{\text{you}} - p_{\text{mkt}},

positive exactly when you're on the right side of the price. Per trade it's small and noisy. But edge accumulates linearly in the number of trades while the standard deviation of your P&L grows like $\sqrt{N}$ , so the signal-to-noise of a consistent calibration advantage scales as $\sqrt{N}$ . A tiny, repeatable edge becomes a near-certainty over enough resolved markets. That's the bot thesis in one line: a better-calibrated brain across thousands of trades, not a bigger one per trade.

How much to stake is its own question, and the honest answer is the Kelly criterion. For a bet at decimal odds $b$ (net fractional win) that you believe wins with probability $w$ , the growth-optimal fraction of bankroll is

f^{*} = \frac{b\,w - (1 - w)}{b} = w - \frac{1 - w}{b}.

Kelly is brutal about miscalibration: it sizes on your $w$ , so an overconfident $w$ overbets, and overbetting a true edge is how calibrated-on-paper agents still blow up. In practice you run fractional Kelly (a quarter of $f^{*}$ is common) precisely because your $w$ is an estimate, not the truth. The sim below sizes at a quarter-Kelly so you can watch this directly.

live market sim · binary YES

calibrated agent

overconfidence (scales evidence)1.00×

p_you

0.50

p_mkt

0.50

P&L (units)

0.00

Posterior updates in logit space: logit(p) += log-likelihood ratio. The knob scales every ratio, so a 2× agent over-reacts to each headline, its P&L flips from a grind up to a slow bleed even though the market resolves the same way. Calibration is the edge; here it is in units, not Brier.

Press play. News ticks arrive, each carrying a log-likelihood ratio; the agent runs a Bayesian update in logit space ( $\operatorname{logit}(p) \mathrel{+}= \text{LLR}$ ), which is just the additive form of

p_{\text{post}} = \frac{p_{\text{prior}}\,L}{p_{\text{prior}}\,L + (1 - p_{\text{prior}})\,(1 - L)}.

When its posterior diverges from the market price by more than the spread, it trades a quarter-Kelly stake. At $1.0\times$ the calibrated agent grinds the P&L upward. Now scale overconfidence to $2\times$ : it over-reacts to every headline, overshoots the truth, overbets, and the same market that resolves the same way turns the curve into a slow bleed. Calibration, in dollars.

The architecture that wins

Here's the part people get wrong: they ask the LLM to do all three jobs and wonder why it's reckless. The winning shape is three separate layers, and the separation is the point.

// 1. LLM proposes a belief — thesis + retrieval. What's the base rate?
//    What did the news just change? Output is a probability, nothing more.
const pYou = await llm(`${question}\n${await retrieve(question)}`); // 0..1
 
// 2. A microstructure model decides the *trade* — not the LLM. Spread,
//    depth, and adverse selection live here, where they're measurable.
const size = microstructure({ pYou, pMkt, spread, depth });
 
// 3. Deterministic risk limits the model cannot talk its way past.
//    Kelly-capped, per-market exposure, daily loss stop. Hard clamp.
const order = riskLimits.clamp(size);

The LLM is good at exactly one thing here: turning messy language (news, base rates, expert priors) into a number $p_{\text{you}}$ . It is bad at sizing, because sizing is arithmetic about spread and bankroll that an autoregressive model will happily hallucinate, and it is dangerous at risk, because the entire failure mode of an overconfident model is that it will argue itself into a bigger bet. So you let the LLM propose and deterministic code dispose. The microstructure layer turns a belief into a stake given liquidity; the risk layer is a clamp the model has no token-level access to. I built hypeduel and mm-bot at B3 on exactly this split, and every time I've let the boundary blur, the blur is where the loss came from.

Why this is the honest eval

Static benchmarks rot. The test set leaks into the next pretraining run, the questions get memorized, and a model can quietly sandbag, underperform on purpose, because nothing is at stake. A market has none of those affordances. You cannot memorize the answer to a question that hasn't resolved. There's no held-out set to leak because the future is the held-out set. And there's no sandbagging, because the only way to score well is to actually move money to the right side of a price, which means the eval has skin in your game, not just the grader's. When I want to know whether a forecasting model is real, I don't want its MMLU number. I want its Brier score on markets that have since resolved, and its P&L net of fees.

The reflexivity problem

The honest caveat, and it's a real one: when bots are 30%+ of the flow, a growing share of them are calibrating against each other. The "crowd" you're trying to beat is increasingly other models, and a market dominated by correlated LLMs can drift away from the ground-truth base rate toward a shared model consensus, confidently, and in lockstep. Calibration against a price is only as honest as the price is anchored to reality. There's also the survivorship problem baked into every leaderboard: you see the wallets that won, not the nine that quietly drained and got deleted, so the "37%" is flattered by the ones who left. And backtested calibration is not live calibration. Fit your bins to history and you've just overfit the eval you swore couldn't be overfit.

So I'll hold both: a market is the most honest eval we have and a reflexive one, and the second fact is the open research problem, not a footnote. The most interesting agents from here won't be the ones with the best static benchmark. They'll be the ones whose calibration survives contact with a crowd that is increasingly made of other agents. That's a sovereign-compute question as much as a trading one: whose models, on whose hardware, anchored to what. Build a calibrated agent and run it against mine. The P&L will tell us who's right, and that's the only eval I trust.

Numbers compiled May 2026 from CoinDesk, PolyBench, and the Polymarket agents repo. The "14/20 wallets" and "37% vs 7-13%" figures are public-leaderboard and on-chain-analysis snapshots, directional, not peer-reviewed, and flagged inline where I lean on them. The Brier and Murphy derivations are standard (Brier 1950, Murphy 1973); the demos use synthetic, seeded data, not live market feeds.