A skill that iterates on a plan until it stops improving

@goodalexander described a harness a friend of his built: it makes a plan, then improves it over and over until the plan can't easily be improved. You hit a top score where further iterations are just noise, and a stronger model can "break the previous ceiling pretty aggressively." That framing is too good not to build, so I turned it into a Claude Code skill you can drop in and use.

The core idea is to stop treating a plan as something you write once and start treating it as something you search for. Generating a plan and judging a plan use different muscles. If you force an explicit critique in between each rewrite, blind editing turns into directed search that climbs toward a target instead of drifting sideways.

The loop

plan        = generate the initial plan
best        = plan
best_score  = score(plan)            # 0-100, against a rubric

repeat:
    candidate = improve(best)        # critique, then targeted rewrite
    s         = score(candidate)
    if s > best_score + MARGIN:      # require a real margin, not noise
        best, best_score = candidate, s
until plateau                        # no real gain over the last few rounds
return best

Four parts do all the work, and the order matters.

The rubric comes first, and it's the ceiling. Before writing a single line of the plan, you write down what "good" means: weighted criteria like goal clarity, completeness, sequencing, feasibility, risks and mitigations, success metrics, and specificity, adapted to the actual task. The loop can only climb as high as the rubric can see. A vague rubric gives you a low, noisy ceiling, so this is where the effort goes. Prefer criteria you can check against something objective ("does every phase have an exit condition?") over pure taste, because objective criteria are harder to game.

The improver critiques before it rewrites. "Make it better" produces lateral drift. Instead: list the specific weaknesses against the rubric, ordered by how much they cost, then write a new version that fixes the top ones while keeping what already scored well.

Scoring needs a margin. Scores are noisy, so you only accept a new best if it beats the old one by a real margin, not by half a point. Otherwise you'll happily "improve" your way into a worse plan. Re-reading the original rubric every few rounds catches the other failure mode, where rewrites quietly start optimizing for the score instead of the goal.

You stop on the plateau. Track the best score over the last few rounds. When it stops moving by more than the margin, you've converged. That's the "iterations are just noise" signal. Keep going past it and you're only shuffling noise around.

Breaking the ceiling

Plain hill-climbing (one critique-and-rewrite per round) is fast and cheap, but it gets stuck in local optima: small edits that can never reach a fundamentally better structure. When it stalls, the skill switches to best-of-N: generate several genuinely different versions of the plan in one round (different framings, not minor variants) score them all, and keep the winner. That's the move that breaks a ceiling incremental edits can't.

This is also where a stronger model earns its keep. Running the improve and score steps on a more capable model is the single biggest lever, because it proposes structurally different rewrites and perceives flaws a weaker scorer is blind to, which raises the ceiling itself. That's exactly the effect @goodalexander pointed at.

Does it actually help?

I tested the skill against plain runs on three plans: an 8-week SaaS launch, a session-cookies-to-JWT auth migration, and a 3-month research study. The skill-run produced a visible trajectory each time, e.g. 48 → 67 → 81 → 89 → 91 → 91, where the last best-of-N round is what pushed past the plateau on the harder two. And the loop reliably surfaced the load-bearing gaps a single pass missed: no rollback path and no token-revocation story in the migration; no success metric and no demand-validation gate in the launch. The plain runs were fine first drafts; the loop's outputs were ones I'd actually ship.

The output

You get the final plan up front, its score with the per-criterion breakdown, the score trajectory so you can see the climb and the plateau, and a short note on the two or three biggest changes from first draft to final, so it's clear the plan was hardened, not just reformatted.

Get it

One command installs it into Claude Code. Paste this into your terminal:

install, one command

curl -fsSL https://seangeng.com/plan-optimizer-skill.zip -o /tmp/plan-optimizer-skill.zip && unzip -o /tmp/plan-optimizer-skill.zip -d ~/.claude/skills/ && rm /tmp/plan-optimizer-skill.zip

That drops the skill into ~/.claude/skills/plan-optimizer/. Restart Claude Code and your agent will run the whole loop whenever you ask it to improve, harden, or stress-test a plan, or just paste a draft and ask "what would make this better?" The plan-optimizer freebie page has the same one-command install plus the full SKILL.md to read before you grab it.

Credit where it's due: the harness concept is @goodalexander's to share. I just packaged it.