The model said done. The app was broken. Both things were true.

By Jon O'Bryan - 2026-06-12 - 10 min read - Cohort instructor

A five-rung verification ladder for non-engineers, anchored to Anthropic's own admission that Claude stops when the work looks done.

It was the second hour of the Saturday workshop. Marcus had shipped his app fifteen minutes earlier. He sent the link to the person sitting next to him and watched her click it.

The login redirected to a blank screen.

Two hours earlier, the agent had told him the feature was done. He had read the message. He had clicked the button on his own machine. The button worked. He had typed thanks, closed the tab, and pasted the live URL into the workshop chat.

Marcus is composite. Twelve people fit in the room. The scene happens every Saturday.

The Anthropic confession

Anthropic publishes a best-practices guide for Claude Code. One section is titled Give Claude a way to verify its work. The opening sentence is the single most important line of writing about AI coding agents that exists in public right now.

Claude stops when the work looks done. Without a check it can run, “looks done” is the only signal available, and you become the verification loop: every mistake waits for you to notice it.

Read that twice.

The agent does not lie. It is not malfunctioning. It is doing exactly what its documentation says it will do. It stops when the work looks done from where it is sitting. Whether the work is done is a separate question, and the answer to that question is your job.

If you are an engineer, your job has tools. Tests. A build. A type checker. A linter. A diff. Five separate signals that fire automatically and disagree with the model the second the model is wrong.

If you are not an engineer, your job has none of those tools. Your job has vibes.

What you actually do

Most non-engineers I watch test their AI-built features the same way. They read the agent’s summary. They click the thing they asked about. The thing works on their machine. They send the link.

That is one rung of a five-rung ladder. Stopping after rung one catches the kind of bug that announces itself by setting fire to the screen. It misses the rest.

It misses the file the agent rewrote that you did not ask about. It misses the login flow that used to work and now redirects to nowhere. It misses the test the agent claimed to have run. It misses the function the agent renamed which is now called from somewhere it forgot. It misses the regressions.

Mostly it misses the regressions.

The pattern is universal

Spend an hour reading Hacker News threads about Claude Code, Cursor, Codex, and any other agent that ships code. You will find the same comment over and over.

One developer ran three agents on the same task. All three claimed done. The diffs ranged from 198 lines to 2,555 lines. Same task. Different agents. All three said the same word: done.

Another developer described the agent claiming to have run the verification step it had previously committed to running. There was no record of the verification. The agent had moved on.

A third: outputs that it has done so, but it has not.

A fourth, on a thread about Codex: claimed completion, code did not compile.

These are engineers. They have the tools. They still get caught.

A non-engineer without the tools is not getting caught more often because they are worse. They are getting caught more often because the only thing standing between claimed done and actually done is them, and they were not given a procedure.

What goes here is a procedure

The Verification ladder The verification ladder is a five-rung procedure non-engineers run after any AI-assisted change. It takes about three minutes, requires no coding knowledge, and catches the silent-regression class of failures that the agent's own 'done' signal will miss. Open full term is the procedure. It has five rungs. Each rung takes between fifteen and sixty seconds. The whole ladder runs in about three minutes. It requires no coding knowledge. It catches the silent-regression class of failures.

Most non-engineers run rung one and stop.

That is fifteen seconds out of one hundred and ninety-five. Not because the other rungs are hard but because nobody told them the other rungs exist.

The rungs, in order.

  1. Read the claim word for word. Fifteen seconds. The agent commits to specific verbs in its summary: I added X, I refactored Y, I removed Z. The difference between the verbs you asked for and the verbs the agent did is written down in plain English at the top of the response. Most people skim past it scanning for the word done and miss the moment the summary contradicts the ask in a single word.

  2. List the files that changed. Thirty seconds. Ask the agent which files it touched, then read the list. What you are looking for is the file you never mentioned. Three days later, you will find out that file held the function the rest of the app was calling.

  3. Walk the flow you asked for. Sixty seconds. Click through the exact user path you described. If you asked for “a button that sends an email,” click the button, open the inbox, look at the email. Not look at the button. Open the inbox.

  4. Walk a flow you did not ask about. Sixty more seconds. This is the rung almost nobody runs. Pick a feature you built two weeks ago. Use it. The agent moved a shared function while implementing the new feature, and the old feature now redirects to a blank screen. The agent had no way to know you cared about that feature. You did.

  5. Ask the agent for evidence. One prompt: Show me the test output. Show me a screenshot of the result. Show me the line of code that handles the case I asked about. The agent that claims it ran the test suite cannot, when pressed, paste the output. The test was never run.

About three minutes, end to end. Catches most of what you, the non-engineer, are equipped to catch.

The math

Most failures of AI-assisted apps are not the spectacular kind. They are not the AI inventing a library that does not exist or hallucinating a database schema. Those failures are loud. They surface in five seconds because the screen turns red.

The dangerous failures are quiet. The button works. The feature ships. Two weeks later a different feature breaks because the agent moved a shared function while you were not looking. Six weeks later the app is held together by hope. The user who breaks it first is the user you most wanted to keep.

Those failures are caught at rungs two and four. They are not caught at rung one.

If you skip from rung one to send the link, your real verification rate is 20%. Five problems exist. You looked for one. You are gambling on the four you did not look at.

This is not a model problem. This is a procedure problem. Anthropic’s own docs say so.

Show me the proof

One more line in the Anthropic guide belongs on the wall.

Have Claude show evidence rather than asserting success: the test output, the command it ran and what it returned, or a screenshot of the result.

Have Claude show evidence rather than asserting success.

Rung five is built around this sentence. The non-engineer’s superpower is not writing tests. It is asking for evidence in a shape they can read.

Engineers ask the agent to run a test. The agent runs it. The output is a test report. The engineer reads it.

You can ask the agent to take a screenshot of the working feature. You can ask the agent to paste the exact line of code that handles the edge case. You can ask the agent to list the three things it did not change. You can ask the agent to write a one-paragraph summary of what a user would experience step by step.

Each of these is evidence in a form you can read. Each of these is harder to fabricate than the word done. The shape of the question matters more than the syntax.

If the agent will not produce the evidence, the rung is not the problem. The completion claim is the problem. Treat absent evidence as absent completion. That is the cheapest move in this entire essay.

The objection

This sounds tedious.

It is.

The alternative is also tedious. The alternative is finding out from a paying user that the login has been broken for two weeks. The alternative is the agent rebuilding a feature for the third time because the regression from the second build was never named. The alternative is a quiet app that nobody trusts because nobody can prove it works.

The verification ladder is the cheaper tedium.

Three minutes a change. Forty changes a week. Two hours a week. That is the cost. The cost of not running the ladder is invisible until it is not.

Marcus, running the ladder

Marcus came back to me after his login broke. We ran the ladder together on the broken build.

The first thing we did was reread the agent’s summary. Two hours earlier, the agent had written: I added the password reset flow and updated the login component to support it. Marcus had read those words. He had not registered that updated the login component meant the login component was now different from the version that had been working twenty minutes before. The verb hid in the middle of the sentence.

Then we asked which files the agent had touched. Six files. Four were expected. Two were not. One of the two was the shared auth wrapper that every protected page used. Marcus had not asked the agent to touch that file. The agent had decided it was easier to touch it than work around it.

Password reset itself worked end to end. Email arrived. Link clicked. New password set. That was the flow Marcus had verified before he sent the link.

Login did not. The agent had not changed the login screen, but it had changed the shared auth wrapper, and login lived downstream of that wrapper. The unrelated walk caught it in forty seconds.

When we asked the agent to paste the line in the auth wrapper that handled an existing session, the agent could not, because the line had been removed. The agent had not noticed. Until we asked, neither had we.

We fixed the wrapper. Logged in. Sent Marcus’s URL to three people. All three got in.

Nothing in any of that required reading the code. Asking the agent to paste a line of code is not the same as reading it. Marcus could not have written that line. He could see it was missing because we asked the agent to show it and the agent could not.

The pattern in one sentence

The model said done. The app was broken. Both things were true.

The job is not to make the model right. The model was being honest about its own state. The job is to stop being the only person in the room who can check. Or, since you might still be that person, to give yourself a procedure that takes three minutes and catches what done misses.

What this is and what this is not

The ladder does not replace tests. If you have an engineer on the team, write tests; they run themselves, which is cheaper in the long run than running the ladder by hand. The ladder is what you run when nobody on the team can write tests and the work still has to ship.

It also does not replace the standing brief. The brief tells the model what is true before any change. The ladder tells you what is true after. Without the brief, the ladder catches more bugs than it should, because the agent had nothing to anchor against in the first place. Without the ladder, the brief is a wish list.

And the ladder is one procedure with five steps, not five separate things. Run them in order. Three minutes. Walk away when the last rung is clean.

Further reading

  • The companion essay on the standing brief: /blog/non-coders-claude-md.
  • The pillar reference for directing AI agents: /learn/directing-ai-agents.
  • The glossary entry for the verification ladder: /glossary/verification-ladder.
  • Anthropic’s own Give Claude a way to verify its work section, in their Claude Code best-practices guide. Read the whole guide. Pay attention to The trust-then-verify gap. It is the same essay as this one, written for engineers.

The next session is in the workshop. Bring the build you cannot test. We will run the ladder on it together. The link is in the chat.