Onchain, vibe coding breaks in week two

I love vibe coding. I do it most weekends. You describe the thing, the agent builds it, you poke at it, it mostly works, you ship. For a prototype that's the correct tool, and I'll defend that to anyone. Nothing's at stake. A wrong button is a shrug and a re-prompt.

Then you point the same workflow at something onchain, and around week two it stops being charming.

A wrong output is a wrong transfer

Here's the difference nobody warns you about. In a normal app, when the agent gets something wrong, you get a bug. A weird layout, a 500, a typo in a label. You fix it and move on. Onchain, a wrong output is a transfer. The mistake doesn't sit in a log waiting for you. It settles. There's no undo button on a chain.

So "the demo worked" means almost nothing. Of course it worked, you watched it once, on a happy path, with test funds. That tells you the code can succeed. It tells you nothing about the third time, when the RPC times out halfway through, the agent retries, and now you've sent the same payment twice.

That's the gap. Vibe coding optimizes for it ran once. Money optimizes for it runs the same way every time, and I can prove it did.

The boring stuff is the actual product

When people ask what's hard about onchain agents, they expect me to say something about the model. It's not the model. The model's the easy part, it's basically a commodity now.

The hard part is everything around it that nobody demos because it's boring. Can the same input produce the same output every time, or does a retry quietly do something different? When a step fails halfway, does the system know where it was, or does it start over and double-spend? When someone asks "what did your agent do at 2am," can you answer with a record, or with a shrug?

None of that shows up in a demo. All of it shows up in week two.

Reproducibility is the thing you're really shipping

If I had to compress it to one line: if you can't reproduce what your agent did, you can't trust it with a wallet. Full stop.

Reproducibility sounds like a testing nicety. Onchain it's the whole product. Trust is the asset you're selling, and trust is just reproducibility wearing a nicer outfit. The moment a user has to take it on faith that your agent moved their funds correctly, you've lost the part that mattered. They don't want a vibe. They want a receipt.

This is why I keep saying the work is logging good enough to prove what ran. Not so you can debug later, though that's nice. So that the execution is a fact you can point at, replay, and verify, instead of a story you tell.

"Scale" onchain isn't a million users

In normal software, "does it scale" means a million concurrent users. Fine. That matters eventually.

Onchain, the scary scale isn't user count. It's time with real money in the loop. The thing that breaks you isn't ten thousand people. It's the same agent, running for two weeks, against real balances, hitting every flaky edge of every RPC and every chain reorg and every rate limit, and having to do the right thing or the safe thing each time. Week two is the load test. Week two is where vibe coding files for divorce.

same task, two agents · vibe-coded · hardenedper-run incident odds: 0.50%

vibe-codedran once, looked fine

incidents

money lost

hardenedretries instead of settling

incidents

retries

runs against real money0 / week two ≈ 1,000

per-run failure probabilityp = 0.50%

at 0.50% per run, the demo (one run) is almost always clean. but over 1,000 runs the odds of at least one settled incident are 99.3% (1 − (1 − p)ⁿ). the gap only shows up at volume. the hardened lane isn't magic, it converts most failures into retries and leaves a small residual.

That's the bet I'm making with B3OS, the onchain engine I build at B3: deterministic execution so the same input genuinely gives the same output, self-healing retries that reroute instead of blindly re-sending, and command logging detailed enough that you can stand behind every action. Not because it's clever. Because that's the floor for letting an agent touch money, and vibe coding doesn't get you to that floor on its own. It was never trying to.

The honest caveat

Don't read this as "vibe coding bad." It's not. For the prototype, the throwaway, the thing that moves $0, please vibe-code it. Determinism and full command logging have real cost, in build time and in your patience, and a script that moves zero dollars does not need any of it. Reaching for that machinery too early is its own mistake.

The line is the production-onchain boundary. On one side, move fast and break things, nothing breaks that matters. On the other side, the thing that breaks is someone's balance, and "I shipped fast" is not a defense you get to make.

If you're building agents that actually touch funds, I wrote two companions to this on what changes once money's real: propose and dispose on keeping a human in the settlement loop, and what onchain agents need on the execution layer underneath them. And if you care about why "same input, same output" is harder than it sounds once a model is involved, the determinism trap in verifiable inference is the deeper cut.

Vibe-code the demo. Then, before it touches a wallet, ask the only question that matters: if this runs a thousand times, can I prove what it did every single one of them? If the answer is no, you don't have a product yet. You have a week one.