The overthinking curve
The reasoning-model pitch is "let it think longer and it gets smarter." Then you watch a trace flip a correct answer to a wrong one at token 9,000 and "fix" it twice more on the way down. More thinking isn't free, and past a point it isn't even positive.
"Let it think longer" has quietly become the industry's whole story about reasoning models — expose a token budget, spend more of it, get a better answer. It's wrong past a point, and the point comes sooner than you'd guess. Accuracy-versus-thinking is an inverted U: it climbs, plateaus, then bends back down as the model second-guesses answers it already had right. The expensive part of test-time scaling isn't the thinking. It's not knowing when to stop.
Everything below is draggable. The curve, the tangent, the flip ratio — move the sliders and watch the optimal place to stop slide left.
"Let it think longer" is a half-truth
The mental model most people carry is a staircase: each extra thousand tokens of chain-of-thought buys a little more accuracy, monotonically, forever. If that were true you'd just turn the budget to max and walk away. It isn't true. Two 2026 papers measured the actual shape and it bends. When More Thinking Hurts (arXiv 2604.10739) traced accuracy against thinking budget across reasoning models and found a peak, then a decline. The Art of Scaling Test-Time Compute (arXiv 2512.02008) framed the consequence: once the curve has a peak, the real problem is budget allocation, not budget size. A third, independent result — Does Thinking More Always Help? (arXiv 2506.04210) — corroborates the decline. The staircase is a story we told ourselves because the first few thousand tokens really do behave like one.
The inverted U, measured
Let be expected accuracy at a thinking budget of tokens. Define the marginal utility over a step tokens:
For an R1-32B-class model the marginal utility starts strongly positive — roughly per 500 tokens in the regime — and decays all the way to negative, about per 500 tokens past . The turning point is where , i.e. : the top of the inverted U. Empirically that peak lands near tokens on AIME and on GPQA Diamond.
For the demo I interpolate the curve with a local quadratic near the peak:
so marginal utility is linear and crosses zero exactly at . That parabola is an interpolation for the slider, not the paper's model — the empirical curve is the ground truth; the parabola just gives me a smooth thing to draw and differentiate. Your curve will differ. The shape won't.
The flip ratio: the model second-guessing itself
Here's the signal that turns "the curve bends" from a vibe into a number. Track the model's current answer as the trace grows. A negative flip is a step where goes correct → incorrect; a positive flip is incorrect → correct. Over a budget window, count the negatives and positives :
When , thinking is net-helpful — more rescues than self-sabotage. When , the trace is destroying more correct answers than it creates. The crossover at happens around tokens (reported just past it, with ). Drag the budget slider in the meter below and watch the bars cross:
The ratio crosses 1.0 around 7.2K tok — the flow turns adverse before accuracy peaks at 12K. Past it, the trace destroys more correct answers than it creates.
The subtle part: this crossover precedes the accuracy peak. Accuracy is a stock and the flip ratio is a flow — the flow turns adverse before the stock peaks, because for a while the positive flips you banked earlier still outweigh the negatives you're now accruing. The flip ratio is the early-warning light. It goes red before the accuracy you can see starts dropping.
Computing it is a few lines, not magic. Given a trace annotated with the running answer at each step:
def flip_ratio(steps):
"""steps: list of (token, current_answer, is_correct) along one trace."""
neg = pos = 0
prev = None
for _, _, correct in steps:
if prev is not None:
if prev and not correct: # right -> wrong
neg += 1
elif not prev and correct: # wrong -> right
pos += 1
prev = correct
return neg / pos if pos else float("inf")The honest caveat on the numbers: the crossover, the , the are all model- and benchmark-specific (R1-32B / s1-32B on AIME / GPQA Diamond). Read as evidence the crossover is real, not as a promise the exact token value is universal.
Why more tokens can subtract accuracy
The mechanism isn't mysterious. A longer trace gives the model more chances to revisit an intermediate answer it already had right and "fix" it — to talk itself out of and into on the way to running out of budget. The entropy of the final answer rises deep into a trace as it churns through alternatives it should have committed away from. The sample-problem ticker in the first demo replays exactly this: early flips are mostly rescues, late flips are mostly self-sabotage, and somewhere past the crossover the model is reliably un-solving problems it had solved. That's not a bug you can prompt away. It's the cost of giving an uncertain process more room to wander.
A stopping rule with a price tag
If tokens were free you'd just stop at the peak and take the best accuracy. Tokens aren't free — that's the whole point of a budget — so price them in. Let be the maximum budget and the price of compute in accuracy units. Define the cost-aware utility:
Maximize it. Set the derivative to zero:
Read that out loud: stop where the marginal accuracy per token equals the price you've set on a token. At the condition is , which is the unconstrained peak — no price, no reason to stop early. For the optimal stop moves left of the peak: you deliberately surrender the last sliver of accuracy because those final tokens are a bad trade. With the quadratic, the stop has a closed form, , and the slider in the first demo is literally dragging the tangent line of slope along the curve until it kisses the optimal stop.
In a serving loop it's a sweep, not calculus:
def stop_budget(t, acc, lam, t_max):
"""t, acc: measured budget/accuracy arrays. lam: price per full budget."""
best_t, best_u = t[0], -1e9
for ti, ai in zip(t, acc):
u = ai - lam * (ti / t_max)
if u > best_u:
best_u, best_t = u, ti
return best_t # the token budget to cap this request at
# sweep lambda -> (tokens, accuracy, % compute saved vs the peak)
peak = t[acc.index(max(acc))]
for lam in (0.0, 0.25, 0.5, 1.0):
tl = stop_budget(t, acc, lam, t_max)
print(lam, tl, f"{100 * (1 - tl / peak):.0f}% saved")How much does it buy? That depends entirely on your curve and your , so watch the live card in the demo rather than quoting me a constant. On the default AIME-shaped curve, nudging off zero moves the stop sharply left for a couple of accuracy points — the regime where you're trimming tokens that were already net-negative. is a policy choice, not a discovered constant: it's the dial that says how many dollars you'll trade per accuracy point. Pick it once, on purpose, and the math tells you where to get off.
One budget for all problems is the real waste
Now the part that actually matters at fleet scale. The peak isn't fixed across problems — easy problems peak early, hard ones peak late. The reported spread runs from to several times larger, a range past (some sets push wider). Flip the first demo to "compare easy/med/hard" and drag the single uniform budget line: there's no place to put it that's right for everyone. A single budget incurs regret on both tails —
— under-thinking the hard problems when , and over-thinking the easy ones when , where extra tokens push them past their peak into outright accuracy loss. Pick high and you cook the easy traffic; pick it low and you starve the hard. No single wins a mixed distribution. An adaptive budget — or a confidence/flip-triggered early exit — dominates any fixed one. That's the entire argument for adaptive test-time compute, in one draggable line.
What to actually ship
Three moves, in order of how much they pay:
- A flip- or confidence-based early exit. Track the running answer; when the flip ratio over a window crosses , or answer entropy stops falling, stop. This is the cheapest win and it's a few lines around your existing decode loop.
- Per-difficulty budgets. Cheap a-priori difficulty proxy in, budget out. Stop spending hard-problem tokens on easy problems.
- Pick your on purpose. The price you'll pay per accuracy point is a business decision. Set it, drop the tangent on your measured curve, stop there.
Honesty
Let me be precise about what's load-bearing and what's illustrative. The specific numbers — peak on AIME, on GPQA, flip ratio at with , the budget spread — are measured on specific models and benchmarks (R1-32B / s1-32B on AIME / GPQA Diamond). They are the measured shape, not universal constants; recalibrate on your own traffic. The quadratic is an interpolation for the demo, not the paper's fit — don't quote the parabola as a finding. And is a knob you choose, so any "X% compute saved for Y points" you read off the demo is a property of that curve and that , illustrating the tradeoff, not a guaranteed return. The defensible claim is the qualitative one, and it's strong: the curve has a peak, the flip ratio warns you before it, and there is a computable place to stop.
Where this points
The defensible move isn't "think more." It's "know your ." I run decentralized inference at B3, where I'm paying per token across an open-weight fleet, so "is this marginal token earning its keep?" isn't academic — it's the unit economics. An open-weight fleet across sizes is already a natural cost/quality ladder, which makes it the right substrate for adaptive budgets: spend hard-problem tokens on hard problems, exit easy ones the moment the flip ratio turns, and let each request's encode your cost stance instead of a flat budget that overthinks half your traffic.
So here's the builder's challenge. Wire a flip-ratio early exit into your serving loop — track the running answer, stop when crosses — and measure the tokens you save against the accuracy you keep. Then show me the number. The expensive part of test-time scaling was never the thinking. It's knowing when to stop, and now you can compute it.
Built from the primary papers, May 2026. The mechanism and the metrics come from When More Thinking Hurts and The Art of Scaling Test-Time Compute, with independent corroboration of the decline. The quadratic accuracy curve is my own interpolation for the sliders; every specific token count and p-value is the paper's, and model-specific — flagged inline so no one quotes the parabola or treats the ~7K crossover as a universal constant.