---
name: verify-loop
description: "Run a supervised, self-verifying loop on a coding task instead of hand-prompting each step. You give it a goal; it works one task at a time, and after every change it actually verifies the result — runs the app, clicks the flow, or runs the tests — with the check done from a fresh perspective so the agent isn't grading its own homework. Commits on green, tracks progress on disk, and stops itself on done / max iterations / repeated failure. MANDATORY TRIGGERS: 'verify loop', 'run verify-loop', 'loop on this', 'set up a loop for this'. STRONG TRIGGERS (use when tied to a real task): 'keep going until it works', 'build this and check it actually works', 'do this autonomously but safely', 'loop until the tests pass', 'work through this list and verify each one'. Built on the idea that the loop is the easy part — verification is the point."
---

# Verify Loop

Everyone's saying "stop prompting, write loops." True — but a loop with nothing to check its work is just the agent confidently agreeing with itself twenty times in a row. The loop is three lines of bash. The thing that makes it *work* is verification: after every change, can the agent actually run the thing and see that it's right?

This skill runs that loop. You give it a goal. It works one task at a time, and after each task it **verifies for real** — runs the app and exercises the change, or runs the tests — using a fresh perspective so the maker isn't also the marker. It commits on green, keeps its memory on disk (not in a bloating chat), and stops itself before it can run away.

It runs **supervised by default**: it does a task, verifies, commits, then checks in. You can turn it loose (full auto) once you trust the goal — see the last section.

---

## The one rule: nothing is "done" until it's verified

A task is only complete when the agent has *observed* it working — not when the code "looks right," not when the diff compiles. Every iteration ends with a real check, and the check has to be capable of failing. If you can't verify something, that's the first problem to solve, not something to skip. Honesty over optimism: a red check reported plainly beats a green one you can't trust.

---

## Step 0 — Pin down the goal and how you'll verify it

Before any code, get two things straight. Ask the user only if you genuinely can't infer them:

- **What does "done" look like?** Turn the request into a concrete, checkable outcome. "Fix the signup bug" becomes "a new user can register, land on the dashboard, and the row exists in the DB." Vague goals make loops spin forever.
- **How will each change be verified?** Find the strongest check available, in this order of preference:
  1. **Run the product as a user would** — start the app, drive the actual flow (CLI command, API call, or the UI via a browser/desktop tool). This is the gold standard and roughly 2–3×'s the quality of the result.
  2. **Run the tests** — existing suite, or write a failing test that captures the goal first, then make it pass.
  3. **Run a type-check / lint / build** — the weakest gate; use it only to supplement 1 or 2, never alone.

If the project has no way to verify the kind of change being made, say so and propose adding one (a smoke test, a script, a test harness) before looping.

---

## Step 1 — Set up the loop's memory (on disk, not in chat)

Create two small files so progress survives a fresh context and gives clean rollback points:

- **`VERIFY_LOOP/GOAL.md`** — the goal, the definition of done, and the verification command(s). This is re-read at the top of every iteration so the agent never drifts from the original intent.
- **`VERIFY_LOOP/PROGRESS.md`** — a living checklist: tasks broken down, each marked `todo` / `doing` / `done` / `blocked`, plus a one-line note on what was verified and the commit hash. Append, don't rewrite history.

Break the goal into the smallest tasks that can each be verified independently. One task per iteration keeps context tight and the git history reversible.

---

## Step 2 — Run the loop (one task per iteration)

Repeat until a stop condition fires:

1. **Read** `GOAL.md` and `PROGRESS.md`. Pick the single highest-priority `todo`. Mark it `doing`.
2. **Implement** just that task. Smallest change that achieves it — no scope creep into adjacent tasks.
3. **Verify (this is the load-bearing step).** Run the verification from Step 0. Crucially, do the check with **fresh eyes — separate the maker from the checker**: spin up a subagent (Task tool) whose only job is to run the thing and report pass/fail with evidence, or at minimum re-derive the expected result independently before running. Don't let the agent that wrote the code also be the one that waves it through.
   - **Green:** the change does what it should, observed, not assumed.
   - **Red:** capture the actual error/output as data for the next attempt.
4. **On green:** commit (`one task = one commit`, descriptive message), mark the task `done` in `PROGRESS.md` with what was verified + the commit hash.
5. **On red:** note the failure in `PROGRESS.md`, leave the task `doing`, and loop again with the error in hand. Do **not** commit broken work. If the *same* check fails ~3 times in a row, mark it `blocked` and stop (see guardrails) — repeating the same failure is the loop telling you the spec or approach is wrong.

---

## Step 3 — Guardrails (the part that stops it running away)

Before the first iteration, set hard limits and honour them:

- **Max iterations** — a ceiling (e.g. 10–20). When hit, stop and report, even if unfinished.
- **No-progress detector** — same error N times (~3), or two iterations with no task moved to `done`, means stop. The agent is stuck; a human needs to look.
- **Budget sense** — if you're aware of a token/cost budget, treat it as a ceiling and stop with room to spare. Never assume infinite calls.
- **Never commit red.** Green commits only. The git history must stay a list of known-good states.
- **Stay in scope.** Only touch what the current task needs. No drive-by refactors — they make verification ambiguous.

---

## Step 4 — Supervised by default: check in

Unless the user has explicitly said "run unattended," **pause after each task** (or a small batch of closely related ones) and report:

```
✅ Task: <what got done>   ·   Verified: <how — "ran signup flow, user reached dashboard">   ·   commit <hash>
Next up: <next task>   ·   <n>/<max> iterations used
Continue? (or tell me to adjust)
```

This keeps a less-experienced user — or a high-stakes change — from discovering twenty unattended iterations went the wrong way. It's the safe on-ramp.

---

## Step 5 — Final report

When a stop condition fires (done, max iterations, or blocked), summarise:

```
# Verify Loop — <goal>

Status: ✅ Done  |  ⏹ Stopped at limit  |  🚧 Blocked
Iterations: <n>/<max>

## Done & verified
- <task> — verified by <check> — commit <hash>
- ...

## Not done
- <task> — <why: blocked / out of iterations / needs a decision>

## What I'd do next
<one or two concrete next steps, or what's needed to unblock>
```

Lead with what's *verified working*, not just what was attempted. If something's blocked, be specific about why and what would unblock it.

---

## Turning it loose (full auto)

Once you trust the goal and the verification is solid, you can run it unattended: tell the skill "run unattended, max 20 iterations" and it'll skip the per-task check-ins and only stop on a guardrail or completion. Two rules before you do:

- **Only go auto when the verification is strong** (Step 0 tier 1 or 2). Auto-looping on a weak type-check-only gate is how you get twenty confident, wrong commits.
- **Keep the guardrails tight** — a real max-iterations and the no-progress detector are what stand between "ships while you sleep" and "burns the budget while you sleep."

The loop is the easy part. The verification — and the brakes — are the whole job.