Skip to content
← writing

The honest guide to LLM model routing

ai
llm
routing
infrastructure

Every request you send to an LLM doesn't need your most expensive model. "Auto routing" is the idea that a tiny, fast classifier looks at each prompt and picks the right model — small and cheap for the easy stuff, frontier-grade for the hard stuff — so you keep top-tier quality while cutting cost. The big gateways are shipping it: OpenRouter's Auto Router, Factory's Router, NotDiamond, Martian, and the open-source RouteLLM.

The pitch is seductive: frontier performance, ~25% cheaper, automatically. The reality is more interesting. The academic papers claim up to 85–98% savings, but the most rigorous independent re-evaluation found several commercial routers performing worse than just always using the single best model. The truth is in between, and the engineering details decide which side you land on. This is the honest version.

prompt
"reverse a string"
prompt
"prove this theorem…"
router
~20 ms classifier
picks a tier
small model
cheap · fast · easy work
frontier model
$$$ · reserved for hard work

A tiny classifier reads each prompt before generation and sends it to the cheapest model that can still get it right.

Why route at all: the price spread

The whole opportunity exists because models span an enormous cost/quality range. RouterBench found monetary cost varies 2–5× for comparable quality across models. FrugalGPT measured up to a 150× price gap between the cheapest and most expensive APIs (GPT-4 input around $30 vs GPT-J at ~$0.20 per 10M tokens).

So if you can predict, before generation, which model a prompt actually needs, you route the cheap ones to cheap models and reserve the expensive models for prompts that earn them. A router is the component that makes that prediction. The whole game is doing it accurately and fast enough that the decision itself doesn't eat the savings.

The dazzling claims

Here's where it gets loud. The published cost savings:

FrugalGPT (cascade)
financial-headlines task
98% saved
RouteLLM
MT-Bench
85% saved
Avengers-Pro
LLMRouterBench
32% saved
Factory Router
Terminal-Bench 2
22% saved
OpenRouter
LLMRouterBench
+25% cost
worse →0← more saved
benchmark / dataset-specific verified on mixed traffic worse than baseline

Cost savings claimed by routers. The headline 85–98% numbers are real but dataset-specific; on mixed production traffic the honest figure is ~20–25% — and one commercial router did worse than no routing at all.

  • RouteLLM (LMSYS, 2024): best routers hit 95% of GPT-4's quality while cutting cost >85% on MT-Bench (~45% on MMLU, ~35% on GSM8K), routing only 14–26% of queries to GPT-4 depending on the training data. Strong model GPT-4, weak model Mixtral-8x7B.
  • FrugalGPT (2023): matched GPT-4 accuracy at up to 98% cost reduction on a financial-headlines task ($33.10 → $0.60), using a cheap→expensive cascade. Also ~73% on OVERRULING, ~59% on CoQA.
  • Factory Router (2026): 20–25% lower cost per session — "99% of Claude Opus 4.7's pass rate at 20% lower cost" on Terminal-Bench 2, "96% at 25% lower cost" on Legacy-Bench.
  • Vendor marketing, treat accordingly: NotDiamond claims up to 25% better accuracy and up to 10× lower cost; Martian claims up to 98% savings.

Notice the gap between Factory's 20–25% and everyone else's 85–98%. That gap is the whole story.

The plot twist

In 2026, LLMRouterBench re-evaluated routing under one framework — 21 datasets, 33 models, ~400K instances, ~1.8B tokens — and the result is sobering. Several recent routers, including the commercial OpenRouter, did not outperform a simple baseline: always picking the single best fixed model. Specifically:

  • OpenRouter's router scored −24.7% relative to Best-Single — worse than no routing at all.
  • The best router tested (Avengers-Pro) achieved only up to ~32% cost reduction while matching, not beating, the best single model.
  • Binary routers like HybridLLM and FrugalGPT struggled to trade cost for savings without losing quality.

Why the collapse? The spectacular 85–98% numbers are real but dataset- and model-pair-specific, and they assume a favorable mix of easy queries. The practitioner reality: those headline numbers fall apart when traffic skews toward hard tasks, and below ~1,000 calls/month the absolute savings rarely justify running a router at all.

So what's the real number?

About 20–25%. That's Factory's verified figure on mixed agentic traffic, and it's the one I'd put my name on. It's not a knock on routing — it's the defensible win. A product that says "we'll save you ~25%, and here's exactly how, honestly" earns more trust than one parroting "90% savings" that evaporates on contact with your real prompts.

How the good ones are built

There are five ways to build a router. Only two are viable on a live, latency- sensitive path. Click through:

Lightweight classifier

real-time viable

A small BERT-class model predicts which tier wins for this prompt. (RouteLLM.)

decision latency
~10–30 ms
new models?
Matrix-factorization variant transfers across model pairs; classifier variants need retraining.
verdict
The pragmatic default.

The structural facts that matter:

  • Cascade is fundamentally different — it doesn't predict, it tries and retries. That's how FrugalGPT hits 98%, but it generates with the cheap model first, which can double latency and breaks streaming. Great for batch, bad for chat.
  • LLM-as-judge is too slow for the live path (+1–5 seconds), but it's the standard trick for creating the training labels to train a fast router.

That leaves the lightweight classifier and embedding+kNN as the real-time options. The latency difference is not subtle:

MiniLM-L6 (warm)
9ms
BERT classifier
30ms
semantic-router (embed)
100ms
LLM-as-judge
3s

Decision latency, log scale. A classifier decides in the noise; an LLM-judge router adds seconds. That gap is why the live path uses small models and the big models only label training data offline.

The classifiers are genuinely tiny. RouteLLM's BERT classifier is ~110M params; MiniLM-L6 is 22M (~5× faster than BERT-base); NVIDIA ships an off-the-shelf prompt-task-and-complexity classifier (DeBERTa-v3-base, ~200M, ~94–99% accuracy depending on the dimension) you can use with zero training. And the training data is cheap: RouteLLM trained on ~65K Chatbot Arena battles; a GPT-4-judge augmentation set of ~120K samples cost about $700 to generate. A few GPU-hours, not a research program.

The one knob that matters

Every good router exposes the same core control: a single cost/quality dial. OpenRouter calls it cost_quality_tradeoff — an integer 0–10, default 7 (0 = always the most capable model, 10 = cheapest wins). Here's what that knob actually does to a stream of prompts:

7
0 · always frontier10 · cheapest wins
small frontier36% to frontier
cost vs all-frontier
43%
57% saved
quality retained
97%
frontier-grade

Drag it. Push toward cheap and cost falls fast — but route hard prompts to a small model and quality starts to bleed. The whole game is finding the knee of that curve. (Illustrative model, not live traffic.)

The rest of the good controls follow from honesty: org-level allow/block of the router, transparency (always tell the user which model actually ran — OpenRouter returns it in the response and charges zero markup), and a hard override (naming a model bypasses the router entirely). The best products converge on exactly these.

The failure modes nobody puts on the landing page

  • Routing collapse. As the cost budget rises, routers get lazy and default to the most expensive model even when a cheap one would do. One 2026 paper names it ("Routing Collapse Index") and recovers ~17% on RouterBench by fixing the objective.
  • Adversarial rerouting. This is the scary one. Rerouting LLM Routers shows that "confounder gadgets" — short, query-independent token prefixes — can force routing to the expensive model with near-100% success and no change in output quality, and the attack transfers black-box across routers. If your routing decisions touch billing, that's a cost-inflation attack, and a follow-up (RerouteGuard) is already about defending against it. (The inverse — forcing a downgrade to jailbreak a query — is sometimes claimed, but the paper's own downgrade attack mostly failed; treat it as future work, not a demonstrated result.)
  • Out-of-distribution lock-in. RouteLLM trained on Chatbot Arena data was near-random on MMLU until augmented. Supervised routers are model lock-in: they need retraining when the model pool changes. That's the strongest argument for embedding+kNN routing when your fleet changes often — you edit an index instead of retraining a model.

The meta-lesson: a router needs its own evals. Log which model was chosen, why, and whether the answer was good — or you can't detect misrouting at all.

Running it at the edge

If your gateway is a Cloudflare Worker, the constraints are real but the pattern is clean. Don't bundle a model into the Worker — transformers.js's ONNX WASM binary is 25.9 MiB, which blows past the 25 MiB asset cap and won't even deploy. Instead use the managed bindings: Workers AI embeddings (@cf/baai/bge-small-en-v1.5, 384-dim, 33.4M params, ~free at 10K neurons/day) into Vectorize for nearest-neighbor lookup. That is the embedding+kNN router, natively, in one request — a sub-200 ms decision that costs essentially nothing.

Where this points

For a sovereign compute network like the one I work on, auto-routing fits unusually well. An open-weight fleet (Llama, Qwen, Gemma, Mistral, DeepSeek across sizes, on operator-owned hardware) is already a natural cost/quality ladder — the exact substrate routing is built to exploit. A fleet that changes constantly favors embedding+kNN routing, where you edit an index instead of retraining. And because routing decisions settle onchain against node stake and payouts, adversarial rerouting stops being a curiosity and becomes a first-class security concern.

The way I'd ship it: not as a default that quietly spends your money, but as a switch you flip on when you want it — opt-in auto-routing on top of an OpenAI-compatible gateway, with the model you'd otherwise call by hand still one override away. The honest positioning writes itself: frontier-quality output, around 25% cheaper, and we'll show you exactly which model ran and why. Not the inflated number. The real one. That turns out to be the more convincing pitch anyway.


Compiled from primary papers and vendor docs, June 2026. Every number here was link-checked and independently fact-checked; where a popular claim didn't survive scrutiny (the 98× vLLM-router speedup is from arXiv 2603.12646, not the paper it's usually cited as; the routing-jailbreak result isn't actually demonstrated) I've said so inline rather than repeat it.