verify-ui-ai
AI-based visual confirmation that a problem is resolved — Level 6 (final-resort) testing. Use ONLY when /verify-ui (mechanical computed-styles or screenshot pixel-diff) cannot reach the assertion AND ...
verify-ui-ai
Level 6 (final resort) visual verification. AI subagent + project-scope test-flow skill produces a structured PASS/FAIL on whether a problem is visually resolved.
When to use
ALL of the following must hold. If any one fails, this skill is the wrong tool:
- The change is on a surface L5 cannot reach — typically a
<canvas>element with no stable DOM child, where computed styles don’t apply and screenshot pixel-diff is too noisy (anti-aliasing, sub-pixel rendering, in-flight animation). - L4 (Playwright E2E) is intractable — the surface is multi-camera, zoom/pan stateful, or canvas-driven such that writing a clean spec is genuinely infeasible, not just “harder than usual.”
- The user explicitly wants AI judgment as the verdict, accepting that the result is one-time evidence, not a reproducible test.
If the surface has a DOM element with computed styles → use /verify-ui (L5).
If the flow can be driven by Playwright → write a .spec.ts (L4).
If the change is logic only → use a unit test (L1).
Hard rules
- NEVER for CI. AI verdicts are non-deterministic. A passing CI run on Tuesday tells you nothing about Wednesday’s run.
- NEVER use
claude -p. Subagent dispatch goes through the Agent tool exclusively. The Agent tool returns a structured result into the parent’s context, respects session permissions and skills, and is observable in the conversation.claude -pis opaque — it produces stdout text the parent must re-parse, starts a fresh process that may not see project skills, and stalls without clean error signaling. - NEVER inline verdict criteria into ad-hoc prompts. The criteria live in the test-flow skill so they survive across runs and across team members.
Prefer project-scope test-flow skills
The test-flow skill MUST be authored under .claude/skills/test-flow-<topic>/SKILL.md (project-scope, checked into the repo), not under $HOME/.claude/skills/test-flow-<topic>/ (personal-only).
Why this matters:
- The verdict procedure is part of the codebase’s testing strategy. Teammates need to see it, review it, and update it as the UI evolves.
- Personal-only skills create verdict drift — a teammate running the “same” test against a different procedure produces incomparable results.
- Project-scope skills are picked up by the repo’s
setup:doc-skillscript and symlinked into~/.claude/skills/on each developer’s machine. - The test-flow skill is the durable artifact of an L6 run. Losing it loses the test.
If you find yourself authoring a test-flow skill under $HOME/.claude/skills/, stop and re-author it under the project’s .claude/skills/. Then commit it in the PR alongside the code that motivated it.
Workflow
Two halves: author the test-flow skill, then dispatch the verification subagent.
Half 1 — author the test-flow skill
A test-flow skill at .claude/skills/test-flow-<topic>/SKILL.md captures:
- What scenario to drive (the exact user-reproduce flow — open template, drop fixture, click button, etc.)
- What to capture (which screenshots, which DOM measurements, which evidence)
- The verdict criteria (specifically: what counts as PASS vs FAIL, tolerance numbers, threshold ratios). Make these mechanical wherever possible. Only the genuinely subjective part remains AI judgment.
- The output format (a JSON-like structured result with named fields the subagent must return)
The skill is per-task, not per-app. A project will accumulate multiple test-flow skills.
Authoring checklist
- Name follows convention:
test-flow-<short-topic-slug>(e.g.test-flow-canvas-image-parity). - Description includes the trigger keywords plus a one-line “use when” — the test-flow skill is triggered by the verification subagent’s prompt, so it must load when the subagent reads its instructions.
- Body is self-contained — the subagent starts fresh with NO conversation history. Everything needed to drive and verdict the test goes in the body.
- Procedure is numbered and concrete — exact selectors, exact URLs, exact viewport sizes, exact fixture paths.
- Verdict criteria are mechanical where possible (tolerance numbers, pixel deltas) and AI-judgment-only where unavoidable.
- Output schema is explicit — what fields the subagent must return (e.g.
stageAImageWidth,stageBImageWidth,ratio,verdict,summary,screenshotPaths). - Skill is committed to the repo (project-scope), not left in
$HOME/.claude/skills/.
Use the skill-creator skill’s init_skill.py to scaffold the new test-flow skill, then write its body.
Half 2 — dispatch the verification subagent
After the test-flow skill is written, dispatch a subagent via the Agent tool:
Agent({
subagent_type: "general-purpose",
description: "<short description>",
prompt: `<self-contained brief — see template below>`,
})
The subagent’s prompt must include:
- Goal: one sentence describing what verdict to produce.
- Skills to load: invoke
/test-flow-<topic>(the just-authored skill) AND a browser-driving skill —/verify-uifor computed-styles / screenshot capture, or/headless-browserfor multi-step interactive flows. - Inputs: per-run inputs the test-flow skill needs (preview URL, fixture path, viewport size).
- Output contract: match the output schema declared in the test-flow skill.
Subagent prompt template
You are a verification subagent. Produce a structured verdict using the test-flow skill below.
## Goal
{one-sentence verdict goal, e.g. "Determine whether the Stage B (second-canvas) image visually matches the Stage A (first-canvas) image at default landing viewport."}
## Skills to load
- /test-flow-<topic> — the test procedure and verdict criteria. Read this first.
- /verify-ui — primary browser-driving skill (computed-styles + screenshots).
- /headless-browser — fallback if /verify-ui doesn't fit the task shape.
## Inputs
- Preview URL: <resolved URL — pass from the parent>
- Fixture: <path or asset reference>
- Viewport: <e.g. 1440x900>
- Any other per-run knobs the test-flow skill expects
## Output contract
Return a structured result message containing exactly these fields:
{ <list each field from the test-flow skill's output schema> }
Plus a `summary` field with a one-line human-readable verdict.
## Don'ts
- Don't improvise the test procedure — follow /test-flow-<topic> exactly.
- Don't change the verdict tolerance — it is locked in /test-flow-<topic>.
- Don't post anywhere — return the result to me; I (the parent agent) handle posting.
After the subagent returns
The parent receives the structured result and decides what to do with it: attach to the PR as evidence, write to a tracked evidence directory, gate a workflow step, etc. The test-flow skill stays on disk for reuse — next time the same test class is needed, the existing skill is invoked without re-authoring.
Archive results for auditability
Because L6 verdicts are non-repeatable, archive the evidence after every run:
- Screenshots from the run
- The structured verdict (the full output schema, not just PASS/FAIL)
- The exact prompt the subagent ran
- Any internal-state dumps captured during the run
Default destination: the GitHub issue or PR comment that drove the work. Screenshots accumulate quickly across runs; committing them into the repo bloats the working tree over time. Issue/PR comments keep the evidence linked to the conversation that produced it and the repo stays lean. The repo-side alternative — an evidence/ directory or similar — is a valid choice when issue/PR comments aren’t available or when project policy requires evidence to travel with the code. Absent a specific policy, post to the issue/PR comment.
Choosing the browser-driving skill — primary vs fallback
| Skill | Best for | When to fall back |
|---|---|---|
/verify-ui | Deterministic computed-style checks; cross-stage CSS / layout parity assertions | Cannot drive multi-step UI flows beyond single-page reads |
/headless-browser | Multi-step interactive flows (drag-drop a file, click → screenshot → click → screenshot); element bounding-rect reads via Playwright CLI | Slightly heavier; only use when /verify-ui can’t reach the test surface |
The test-flow skill should name BOTH so the subagent picks based on the task shape. If /verify-ui returns “cannot perform this flow” the subagent switches to /headless-browser without re-prompting the parent.
Reusability — the test-flow skill outlives the test
A test-flow skill is not a one-shot scaffold for a single PR. It is a permanent artifact that captures “how to verify this class of behavior in this codebase.” When a similar test is needed later (regression check, repeated verification across PRs), invoke the same test-flow skill — the AI subagent gets the same context and produces consistent verdicts.
Sign that you’re using this pattern correctly:
- The test-flow skill is checked into the project’s
.claude/skills/(project-scope, shared with the team). - Subsequent invocations DO NOT re-author the skill — they just dispatch a fresh subagent that loads it.
- Updates to the procedure happen by editing the test-flow skill, not by inlining new instructions in the subagent prompt.
Risks and limitations
- Non-deterministic. Same flow, different verdicts across runs. A single PASS is one data point.
- Cost-bearing. Each run spawns a subagent, drives a browser, consumes tokens. Cost is non-trivial even locally; on CI it compounds rapidly and unpredictably.
- Hallucination risk. An AI judge can confidently report PASS on a broken UI if criteria are vague. Tighten criteria with explicit thresholds wherever possible.
- Verdict drift across model versions. Edge cases may judge differently as models change. Lock procedure in the test-flow skill; don’t inline criteria into ad-hoc prompts.
- Not reproducible. A failed run cannot be replayed deterministically. The output schema must include screenshots and internal-state dumps so failures are investigable even when not reproducible.
Example skeleton — what a real test-flow skill looks like
---
name: test-flow-canvas-image-parity
description: Verify the Stage B (second-canvas) image visually matches the Stage A (first-canvas) image at default landing viewport. Use when /verify-ui-ai dispatches a subagent for canvas-image-parity verification.
---
# Test flow: canvas image parity (Stage A vs Stage B)
## Scenario
1. Open <preview URL from inputs> at viewport 1440x900.
2. Open the entry that lands on Stage A (the first canvas).
3. Begin the selection / cropping mode for Stage A.
4. Drop the fixture image (e.g. `<repo>/e2e/fixtures/red-100-fits-canvas.png`) onto the Stage A canvas layer.
5. Capture screenshot A (Stage A with the placed image visible).
6. Advance to Stage B (e.g. click the "commit and open next stage" affordance).
7. Wait for the Stage B canvas to mount and become visible.
8. Capture screenshot B (Stage B with the placed image visible).
## Measurements
- Stage A image width (CSS px): read via a test-only hook on the Stage A layer state plus the Stage A canvas CSS scale.
- Stage B image width (CSS px): read via a test-only hook on the Stage B state plus the Stage B camera zoom and canvas CSS rect.
- ratio = stageB / stageA.
## Verdict
PASS if ratio ∈ [0.95, 1.05] (±5%). FAIL otherwise.
## Output schema
{
stageAImageWidth: number,
stageBImageWidth: number,
ratio: number,
delta: number,
verdict: "PASS" | "FAIL",
summary: string,
stageAScreenshot: string (path),
stageBScreenshot: string (path),
toolUsed: "verify-ui" | "headless-browser"
}
The example shows the shape; the verification subagent reads this and follows the procedure verbatim. For the broader testing-strategy context, see the project’s Level 6 documentation page (src/content/docs/testing-levels/level-6-ai-based-verification.mdx).