Skip to content
← writing

The determinism trap in verifiable inference

ai
crypto
verifiable-inference
floating-point
decentralized-inference

The first thing everyone proposes for verifying AI inference is to re-run the model and compare the outputs. The second thing they learn is that two honest GPUs don't agree. That's not a bug — that's IEEE-754.

This is the load-bearing post in a series on verifiable inference, because it explains why the naive thing fails — and the naive thing is what crypto-AI Twitter reaches for every single time. I run decentralized inference at B3 (b3iq): an open-weight fleet where nodes are paid to produce outputs we later have to check. The determinism trap is the first wall you hit the moment you try to check anything by re-running it. EigenAI named it the "determinism paradox" in early 2026, and the name is good, so I'll borrow it.

Just re-run it

Here is the scheme everyone writes on the whiteboard first. A node claims it ran model ff on input xx and returns y^\hat{y}. To verify, a checker runs f(x)f(x) itself and accepts if the answer matches:

# The naive verification protocol, in three lines.
y_recomputed = f(x)            # checker re-runs the claimed model
assert y_recomputed == y_claimed   # bit-exact equality
# accept ⇒ the node ran f(x) honestly

It feels airtight. Same model, same input, same answer — anything else is fraud. The trouble is the third line. == on floating-point tensors is a lie about what hardware does, and the whole post is unpacking why.

Floating-point addition is not associative

Start from the rounding model. Let fl()\mathrm{fl}(\cdot) be round-to-nearest in a floating-point format with unit roundoff uu — for fp32, u=2245.96×108u = 2^{-24} \approx 5.96\times10^{-8}; for fp16, u=2114.88×104u = 2^{-11} \approx 4.88\times10^{-4}. The standard model of floating-point arithmetic (Goldberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic) says every individual operation is exact up to a relative perturbation bounded by uu:

fl(a+b)=(a+b)(1+δ),δu.\mathrm{fl}(a + b) = (a + b)(1 + \delta), \qquad |\delta| \le u.

Now sum three numbers two different ways. Group left:

fl(fl(a+b)+c)=((a+b)(1+δ1)+c)(1+δ2),\mathrm{fl}\big(\mathrm{fl}(a + b) + c\big) = \big((a+b)(1+\delta_1) + c\big)(1+\delta_2),

and group right:

fl(a+fl(b+c))=(a+(b+c)(1+δ3))(1+δ4).\mathrm{fl}\big(a + \mathrm{fl}(b + c)\big) = \big(a + (b+c)(1+\delta_3)\big)(1+\delta_4).

Subtract the two. The δ\delta's are independent — they come from rounding different intermediate values — so to first order the results differ by

Δ(a+b)δ1(b+c)δ3+outer rounding=O ⁣(umax(a+b,b+c)).\Delta \approx (a+b)\,\delta_1 - (b+c)\,\delta_3 + \text{outer rounding} = \mathcal{O}\!\big(u \cdot \max(|a+b|,\,|b+c|)\big).

Reordering a sum perturbs the result by up to roughly uu times the magnitude of the partial sums — not the final answer. That's the catch. If your partial sums are large even though the final answer is small, the error term is large relative to the answer.

Catastrophic cancellation pushes this to its limit. Take a=1016a = 10^{16}, b=1016b = -10^{16}, c=1c = 1:

(a+b)+c=0+1=1,a+(b+c)=1016+(1016+1)=0.(a + b) + c = 0 + 1 = 1, \qquad a + (b + c) = 10^{16} + (-10^{16} + 1) = 0.

The right grouping rounds b+c=1016+1b + c = -10^{16} + 1 straight back to 1016-10^{16} — there's no bit in the fp mantissa to hold the +1+1 at that exponent — so the cc term vanishes and you get 00. Same three constants, answers 11 and 00. Not the low bits. The whole answer.

import numpy as np
a, b, c = np.float32(1e16), np.float32(-1e16), np.float32(1.0)
print((a + b) + c)   # 1.0
print(a + (b + c))   # 0.0

That's the entire post in three constants. Everything below is just "this also happens on ordinary vectors, and a GPU picks the grouping for you."

It happens on ordinary numbers too

The 101610^{16} example looks contrived. It isn't — it's just the loud version. Take a perfectly mundane ten-element vector with a couple of big canceling pairs and some fractional terms, and sum it left-to-right versus as a pairwise tree:

import numpy as np, struct
v = np.array([1000.0, 1.01655, -1000.0, 3.14159, 250.0,
              -250.0, 0.71726, 125.0, -125.0, 43.17452], dtype=np.float32)
 
def seq(xs):                       # left-to-right accumulate
    acc = np.float32(0)
    for x in xs: acc = np.float32(acc + x)
    return acc
 
def pairwise(xs):                  # tree reduction
    xs = list(xs)
    while len(xs) > 1:
        xs = [np.float32(xs[i] + xs[i+1]) for i in range(0, len(xs)-1, 2)] + \
             ([xs[-1]] if len(xs) % 2 else [])
    return xs[0]
 
s, p = seq(v), pairwise(v)
bits = lambda x: struct.pack('>f', x).hex()
print(f"{s:.6f}  {bits(s)}")       # 48.049900  4240331 9
print(f"{p:.6f}  {bits(p)}")       # 48.049911  4240331 c
print(np.allclose(s, p), s == p)   # True  False

To four decimals both read 48.0499. The raw bits end in 19 versus 1c. allclose says yes; == says no. Here's that running live — pick a reduction order for each side and watch the decimal stay still while the bits move:

sum the same 10-number fp32 vector two ways — compare the raw bits

Sequential, pairwise, and a random shuffle each land on a different fp32 bit pattern. There is no canonical order, which is the next problem.

Why two honest GPUs disagree

On a CPU running one thread, you at least get a deterministic order. A GPU doesn't sum left-to-right. A dot product iaibi\sum_i a_i b_i over nn terms is computed as a tree reduction across threads and warps, then an atomic or sequential combine across the partial sums. The shape of that tree is a function of the block size, which is a function of the matmul tiling, which is a function of m,n,km, n, k — and therefore of the batch size — and the cuBLAS heuristic that picks a kernel for this GPU and this library version. Change any one of those and the parenthesization changes, and by the section above, the low bits change with it.

So "same model, same input" does not give you "same bits." The reduction order is also a function of:

  • Batch size. Batch 1 versus a batched call selects a different kernel and a different tiling, so the sum is grouped differently. PyTorch's reproducibility docs say it outright: results need not be reproducible "between CPU and GPU executions, even when using identical seeds," and batching changes numerics.
  • GPU model and SM count. The tile that's optimal on an A100 isn't optimal on an H100; different occupancy, different tree.
  • cuBLAS / cuDNN / kernel version. A library bump can re-tune the heuristic and silently change which kernel — and which reduction order — runs.
  • Thread and SM scheduling. Atomic combines across blocks land in nondeterministic order unless you force otherwise.

None of these implies anyone cheated. They're four honest nodes with four legitimate kernels. Bit-exact re-execution flags all four as fraud.

The determinism paradox

Here's the trap stated crisply, the way EigenAI framed it: you cannot simultaneously demand bit-exactness from honest nodes and use bit-exactness as your fraud signal. The honest population is itself not bit-exact across hardware, so any rule sharp enough to catch a liar also convicts honest nodes running different silicon. Re-run-and-diff doesn't verify honesty. It detects hardware diversity and calls it fraud.

You get exactly three ways out, and the rest of the post is each one.

Escape 1 — make the math exact

The cleanest escape is to stop using floating-point. Non-associativity is a property of rounding; integer addition has no rounding, so it's exactly associative — (a+b)+c=a+(b+c)(a+b)+c = a+(b+c) for all integers, every order, every machine. Quantize every activation to fixed-point: pick QQ fractional bits, scale by 2Q2^{Q}, round once to an integer, and do all the accumulation in int64:

import numpy as np
Q = 16
to_fixed = lambda x: np.int64(round(float(x) * (1 << Q)))
v = [1000.0, 1.01655, -1000.0, 3.14159, 250.0, -250.0,
     0.71726, 125.0, -125.0, 43.17452]
fx = [to_fixed(x) for x in v]
print(sum(fx) == sum(reversed(fx)))   # True — order-independent, exactly

Toggle the same vector between fp32 and fixed-point. In float, shuffling the order changes the bits; in fixed-point both orders are byte-identical:

toggle float32 ↔ fixed-point — watch the bits go order-independent

This is not a niche trick — it's why zkML lives in a finite field. A zero-knowledge proof system can only constrain arithmetic over a prime field Fp\mathbb{F}_p, where every operation is exact integer arithmetic mod pp. The moment you commit to proving inference in ZK, you've already paid the quantization tax, and determinism comes free as a side effect. The cost is everything else: proving a single transformer forward pass is orders of magnitude slower than running it. I map that curve in the zkML cost curve.

Escape 2 — pin the kernels

If you want to stay in floating-point, you can try to nail down the one thing that varies: the order. PyTorch exposes the knobs.

import torch
torch.use_deterministic_algorithms(True)   # error out on nondeterministic kernels
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False      # stop auto-tuning the kernel choice
# + fixed batch size, pinned cuBLAS/cuDNN versions, one GPU SKU

This genuinely works — within its blast radius. Same GPU model, same library versions, same batch size, fixed reduction order: you get reproducible bits. That's enough for a single operator to reproduce their own runs, or for a homogeneous cluster you fully control.

Here's the caveat I won't soften: use_deterministic_algorithms(True) gives you determinism on the same hardware and the same library version. It does not give you cross-GPU bit-exactness. An A100 and an H100 with pinned everything will still disagree, because the deterministic kernel for each is a different deterministic kernel. For a permissionless fleet — which is the whole point of decentralized inference — you don't control the hardware, so this escape buys you reproducibility inside a node and nothing at all across nodes. It also costs throughput: deterministic kernels are slower, and forbidding the autotuner forfeits the fast paths.

Escape 3 — tolerate, and bleed soundness

So you give up on bit-exactness and add a tolerance. Accept if the recomputed output is close enough:

accept    y^f(x)ε.\text{accept} \iff \lVert \hat{y} - f(x) \rVert \le \varepsilon.

Pick ε\varepsilon big enough to forgive honest fp16 jitter and you stop falsely convicting honest nodes. This is the family that approaches like VeriLLM live in — sample some tokens, check the recomputed logits sit within a band. It's the most practical of the three. It's also where the adversary moves in.

Define the hideable gap as the largest model substitution that still passes:

Δ(ε)=sup{fg  :  g passes the challenge}.\Delta(\varepsilon) = \sup \big\{ \lVert f - g \rVert_\infty \;:\; g \text{ passes the challenge} \big\}.

A dishonest node serves a cheaper model gg instead of the honest ff and passes iff f(x)g(x)ε\lVert f(x) - g(x) \rVert \le \varepsilon on the challenged inputs. The problem is a quantitative coincidence: the ε\varepsilon you need to absorb honest jitter is order ulogitsu \cdot \lVert\text{logits}\rVert, which in logit space often lands around 10210^{-2} to 10110^{-1} — and on many tokens, that is larger than the logit shift from running g=int8(f)g = \mathrm{int8}(f) instead of ff. The cheaper, quantized model lives inside the band you built for honesty.

Drag ε\varepsilon and watch both failure modes at once. The green dots are honest reordered outputs; the orange diamonds are the cheaper model's. Too tight and the honest dots fall outside (false fraud). Too loose and the fraud model slips in:

ε-tolerance number line — drag ε to see honest dots and a fraud model

That's the soft underbelly. The tolerance you need for honesty is the tolerance the adversary needs for fraud, and they overlap. I want to be precise about how strong this claim is: it is a soundness concern, not a proof that every ε\varepsilon-scheme is broken. VeriLLM and its relatives shrink Δ(ε)\Delta(\varepsilon) on purpose — adversarial challenge sets, many sampled tokens, statistical aggregation across a sequence so a model swap that's invisible on one token is loud across a thousand. The gap narrows. It does not cleanly close. The formal sample-size version of "how many tokens must I challenge to push the cheating probability below β\beta" is its own post: are you getting the model you paid for.

A few things I'm being careful about

  • torch.use_deterministic_algorithms(True) is determinism on the same hardware and library version. It is not cross-GPU bit-exactness. Anyone selling it as the latter is wrong, and I've watched people sell it as the latter.
  • The "determinism paradox" name is EigenAI's framing from 2026 — credible and sticky, but it's their coinage, not a textbook term. Attribute it.
  • The hideable-gap argument is the soft underbelly of ε\varepsilon-schemes, not a fatality. Good challenge design measurably shrinks the gap. I'd rather state the concern sharply and let the next post quantify the defense than wave it away.

Where this leaves the field

Bit-exactness is a fantasy on floating-point silicon. So re-run-and-diff was never a verification layer — it was a hardware-homogeneity detector wearing a trench coat. You're left with exactly three real options, and they map onto the whole verifiable-inference frontier: make the math exact (ZK), trust the silicon (TEE), or make fraud expensive after the fact (opML). The next post puts all three on one chart — cost against trust — and shows why you're always paying for verification in one of those three currencies.

If you've ever shipped "just re-run and diff" as a verification layer: go check what your honest nodes actually return on two different GPUs. If you build something real on top of this — a challenge scheme that shrinks Δ(ε)\Delta(\varepsilon), a deterministic kernel set that survives hardware diversity — show me.