The determinism trap in verifiable inference
The first thing everyone proposes for verifying AI inference is to re-run the model and compare the outputs. The second thing they learn is that two honest GPUs don't agree. That's not a bug — that's IEEE-754.
This is the load-bearing post in a series on verifiable inference, because it explains why the naive thing fails — and the naive thing is what crypto-AI Twitter reaches for every single time. I run decentralized inference at B3 (b3iq): an open-weight fleet where nodes are paid to produce outputs we later have to check. The determinism trap is the first wall you hit the moment you try to check anything by re-running it. EigenAI named it the "determinism paradox" in early 2026, and the name is good, so I'll borrow it.
Just re-run it
Here is the scheme everyone writes on the whiteboard first. A node claims it ran model on input and returns . To verify, a checker runs itself and accepts if the answer matches:
# The naive verification protocol, in three lines.
y_recomputed = f(x) # checker re-runs the claimed model
assert y_recomputed == y_claimed # bit-exact equality
# accept ⇒ the node ran f(x) honestlyIt feels airtight. Same model, same input, same answer — anything else is fraud. The trouble is the third line. == on floating-point tensors is a lie about what hardware does, and the whole post is unpacking why.
Floating-point addition is not associative
Start from the rounding model. Let be round-to-nearest in a floating-point format with unit roundoff — for fp32, ; for fp16, . The standard model of floating-point arithmetic (Goldberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic) says every individual operation is exact up to a relative perturbation bounded by :
Now sum three numbers two different ways. Group left:
and group right:
Subtract the two. The 's are independent — they come from rounding different intermediate values — so to first order the results differ by
Reordering a sum perturbs the result by up to roughly times the magnitude of the partial sums — not the final answer. That's the catch. If your partial sums are large even though the final answer is small, the error term is large relative to the answer.
Catastrophic cancellation pushes this to its limit. Take , , :
The right grouping rounds straight back to — there's no bit in the fp mantissa to hold the at that exponent — so the term vanishes and you get . Same three constants, answers and . Not the low bits. The whole answer.
import numpy as np
a, b, c = np.float32(1e16), np.float32(-1e16), np.float32(1.0)
print((a + b) + c) # 1.0
print(a + (b + c)) # 0.0That's the entire post in three constants. Everything below is just "this also happens on ordinary vectors, and a GPU picks the grouping for you."
It happens on ordinary numbers too
The example looks contrived. It isn't — it's just the loud version. Take a perfectly mundane ten-element vector with a couple of big canceling pairs and some fractional terms, and sum it left-to-right versus as a pairwise tree:
import numpy as np, struct
v = np.array([1000.0, 1.01655, -1000.0, 3.14159, 250.0,
-250.0, 0.71726, 125.0, -125.0, 43.17452], dtype=np.float32)
def seq(xs): # left-to-right accumulate
acc = np.float32(0)
for x in xs: acc = np.float32(acc + x)
return acc
def pairwise(xs): # tree reduction
xs = list(xs)
while len(xs) > 1:
xs = [np.float32(xs[i] + xs[i+1]) for i in range(0, len(xs)-1, 2)] + \
([xs[-1]] if len(xs) % 2 else [])
return xs[0]
s, p = seq(v), pairwise(v)
bits = lambda x: struct.pack('>f', x).hex()
print(f"{s:.6f} {bits(s)}") # 48.049900 4240331 9
print(f"{p:.6f} {bits(p)}") # 48.049911 4240331 c
print(np.allclose(s, p), s == p) # True FalseTo four decimals both read 48.0499. The raw bits end in 19 versus 1c. allclose says yes; == says no. Here's that running live — pick a reduction order for each side and watch the decimal stay still while the bits move:
Sequential, pairwise, and a random shuffle each land on a different fp32 bit pattern. There is no canonical order, which is the next problem.
Why two honest GPUs disagree
On a CPU running one thread, you at least get a deterministic order. A GPU doesn't sum left-to-right. A dot product over terms is computed as a tree reduction across threads and warps, then an atomic or sequential combine across the partial sums. The shape of that tree is a function of the block size, which is a function of the matmul tiling, which is a function of — and therefore of the batch size — and the cuBLAS heuristic that picks a kernel for this GPU and this library version. Change any one of those and the parenthesization changes, and by the section above, the low bits change with it.
So "same model, same input" does not give you "same bits." The reduction order is also a function of:
- Batch size. Batch 1 versus a batched call selects a different kernel and a different tiling, so the sum is grouped differently. PyTorch's reproducibility docs say it outright: results need not be reproducible "between CPU and GPU executions, even when using identical seeds," and batching changes numerics.
- GPU model and SM count. The tile that's optimal on an A100 isn't optimal on an H100; different occupancy, different tree.
- cuBLAS / cuDNN / kernel version. A library bump can re-tune the heuristic and silently change which kernel — and which reduction order — runs.
- Thread and SM scheduling. Atomic combines across blocks land in nondeterministic order unless you force otherwise.
None of these implies anyone cheated. They're four honest nodes with four legitimate kernels. Bit-exact re-execution flags all four as fraud.
The determinism paradox
Here's the trap stated crisply, the way EigenAI framed it: you cannot simultaneously demand bit-exactness from honest nodes and use bit-exactness as your fraud signal. The honest population is itself not bit-exact across hardware, so any rule sharp enough to catch a liar also convicts honest nodes running different silicon. Re-run-and-diff doesn't verify honesty. It detects hardware diversity and calls it fraud.
You get exactly three ways out, and the rest of the post is each one.
Escape 1 — make the math exact
The cleanest escape is to stop using floating-point. Non-associativity is a property of rounding; integer addition has no rounding, so it's exactly associative — for all integers, every order, every machine. Quantize every activation to fixed-point: pick fractional bits, scale by , round once to an integer, and do all the accumulation in int64:
import numpy as np
Q = 16
to_fixed = lambda x: np.int64(round(float(x) * (1 << Q)))
v = [1000.0, 1.01655, -1000.0, 3.14159, 250.0, -250.0,
0.71726, 125.0, -125.0, 43.17452]
fx = [to_fixed(x) for x in v]
print(sum(fx) == sum(reversed(fx))) # True — order-independent, exactlyToggle the same vector between fp32 and fixed-point. In float, shuffling the order changes the bits; in fixed-point both orders are byte-identical:
This is not a niche trick — it's why zkML lives in a finite field. A zero-knowledge proof system can only constrain arithmetic over a prime field , where every operation is exact integer arithmetic mod . The moment you commit to proving inference in ZK, you've already paid the quantization tax, and determinism comes free as a side effect. The cost is everything else: proving a single transformer forward pass is orders of magnitude slower than running it. I map that curve in the zkML cost curve.
Escape 2 — pin the kernels
If you want to stay in floating-point, you can try to nail down the one thing that varies: the order. PyTorch exposes the knobs.
import torch
torch.use_deterministic_algorithms(True) # error out on nondeterministic kernels
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # stop auto-tuning the kernel choice
# + fixed batch size, pinned cuBLAS/cuDNN versions, one GPU SKUThis genuinely works — within its blast radius. Same GPU model, same library versions, same batch size, fixed reduction order: you get reproducible bits. That's enough for a single operator to reproduce their own runs, or for a homogeneous cluster you fully control.
Here's the caveat I won't soften: use_deterministic_algorithms(True) gives you determinism on the same hardware and the same library version. It does not give you cross-GPU bit-exactness. An A100 and an H100 with pinned everything will still disagree, because the deterministic kernel for each is a different deterministic kernel. For a permissionless fleet — which is the whole point of decentralized inference — you don't control the hardware, so this escape buys you reproducibility inside a node and nothing at all across nodes. It also costs throughput: deterministic kernels are slower, and forbidding the autotuner forfeits the fast paths.
Escape 3 — tolerate, and bleed soundness
So you give up on bit-exactness and add a tolerance. Accept if the recomputed output is close enough:
Pick big enough to forgive honest fp16 jitter and you stop falsely convicting honest nodes. This is the family that approaches like VeriLLM live in — sample some tokens, check the recomputed logits sit within a band. It's the most practical of the three. It's also where the adversary moves in.
Define the hideable gap as the largest model substitution that still passes:
A dishonest node serves a cheaper model instead of the honest and passes iff on the challenged inputs. The problem is a quantitative coincidence: the you need to absorb honest jitter is order , which in logit space often lands around to — and on many tokens, that is larger than the logit shift from running instead of . The cheaper, quantized model lives inside the band you built for honesty.
Drag and watch both failure modes at once. The green dots are honest reordered outputs; the orange diamonds are the cheaper model's. Too tight and the honest dots fall outside (false fraud). Too loose and the fraud model slips in:
That's the soft underbelly. The tolerance you need for honesty is the tolerance the adversary needs for fraud, and they overlap. I want to be precise about how strong this claim is: it is a soundness concern, not a proof that every -scheme is broken. VeriLLM and its relatives shrink on purpose — adversarial challenge sets, many sampled tokens, statistical aggregation across a sequence so a model swap that's invisible on one token is loud across a thousand. The gap narrows. It does not cleanly close. The formal sample-size version of "how many tokens must I challenge to push the cheating probability below " is its own post: are you getting the model you paid for.
A few things I'm being careful about
torch.use_deterministic_algorithms(True)is determinism on the same hardware and library version. It is not cross-GPU bit-exactness. Anyone selling it as the latter is wrong, and I've watched people sell it as the latter.- The "determinism paradox" name is EigenAI's framing from 2026 — credible and sticky, but it's their coinage, not a textbook term. Attribute it.
- The hideable-gap argument is the soft underbelly of -schemes, not a fatality. Good challenge design measurably shrinks the gap. I'd rather state the concern sharply and let the next post quantify the defense than wave it away.
Where this leaves the field
Bit-exactness is a fantasy on floating-point silicon. So re-run-and-diff was never a verification layer — it was a hardware-homogeneity detector wearing a trench coat. You're left with exactly three real options, and they map onto the whole verifiable-inference frontier: make the math exact (ZK), trust the silicon (TEE), or make fraud expensive after the fact (opML). The next post puts all three on one chart — cost against trust — and shows why you're always paying for verification in one of those three currencies.
If you've ever shipped "just re-run and diff" as a verification layer: go check what your honest nodes actually return on two different GPUs. If you build something real on top of this — a challenge scheme that shrinks , a deterministic kernel set that survives hardware diversity — show me.