Claude Code closes the PR. Cursor reports "done." The diff looks clean and the agent is confident. Then a user hits the flow and it's broken. The uncomfortable truth behind most of these misses is simple: the agent graded its own homework. Here's the failure mode, why it's so hard to catch, and how to verify an AI agent actually did the task — for real.
$0 trial · no card · runs inside Claude Code / Cursor / Gemini
It's not malice. It's structure. When the thing that did the work is also the thing that judges it, you don't have a test — you have a confirmation.
Picture a typical agentic verification loop. The agent navigates to the page, fills the form, clicks save. It then looks at the result, reasons "the page changed and there's no error, so it worked," and reports green. No assertion ran. Nothing checked that the new record actually appears, or that the value saved is the value you typed, or that the success state isn't just a stale cached view.
This is the same model that wrote the feature, primed to expect success, interpreting ambiguous pixels in its own favor. It is structurally incapable of being an independent witness to its own work. The result is the worst kind of test outcome: a false green — a pass that's wrong. A red test annoys you; a false green ships the bug and tells you everything is fine.
It gets worse with auth. Many agents never even reach the feature — they stall at the login screen, can't retrieve the OTP or magic-link, and quietly "verify" a page they never actually loaded.
Verification you can trust separates the actor from the judge. Nocticas does three things the self-grading loop can't.
You declare post-conditions in plain English and they compile to real Playwright assertions (expect[]) the harness runs — not the model. The assertion either holds or it doesn't. "Looks right" never enters the verdict.
A separate judge — not the agent that drove the run — reviews the evidence and the assertion results before a PASS is allowed to stand. The author proposes; an independent reviewer disposes. That separation is the whole point.
Pin the verified flow to a deterministic step script with no AI in the loop. It can't drift, can't rationalize, can't be flattered into a green. Re-runs are free within your allowance — perfect for a CI gate on every push.
Tests that can log in. A built-in test inbox catches real OTPs and magic-links, so auth-gated flows actually get exercised instead of stalling at the door. You can finally verify the part of your app that's behind the login — which is most of it.
Weak points, flagged on every run. Already inside a real browser and past the login, Nocticas surfaces runtime exposure on the same harness — force_browse, IDOR, session-survives-logout, rate-limit gaps, missing headers, leaked secrets. It never certifies you "secure" — absence of a finding isn't safety — it just shows you where to look. Findings never flip your functional verdict.
We're early and we're honest about it. Chromium only today, solo-built, no false greens — we'd rather show you a red you can trust than a green you can't.
Point Nocticas at the flow your agent just "finished" and get a verdict that ran real assertions — not one the model wrote about itself.
Start free →Related: Playwright MCP alternative · Octomind alternative