zudo-test-wisdom

Type to search...

to open search from anywhere

verify-ui-ai

AI-based visual confirmation that a problem is resolved — Level 6 (final-resort) testing. Use ONLY when /verify-ui (mechanical computed-styles or screenshot pixel-diff) cannot reach the assertion AND ...

verify-ui-ai

Level 6 (final resort) visual verification. AI subagent + project-scope test-flow skill produces a structured PASS/FAIL on whether a problem is visually resolved.

When to use

ALL of the following must hold. If any one fails, this skill is the wrong tool:

  1. The change is on a surface L5 cannot reach — typically a <canvas> element with no stable DOM child, where computed styles don’t apply and screenshot pixel-diff is too noisy (anti-aliasing, sub-pixel rendering, in-flight animation).
  2. L4 (Playwright E2E) is intractable — the surface is multi-camera, zoom/pan stateful, or canvas-driven such that writing a clean spec is genuinely infeasible, not just “harder than usual.”
  3. The user explicitly wants AI judgment as the verdict, accepting that the result is one-time evidence, not a reproducible test.

If the surface has a DOM element with computed styles → use /verify-ui (L5). If the flow can be driven by Playwright → write a .spec.ts (L4). If the change is logic only → use a unit test (L1).

Hard rules

  • NEVER for CI. AI verdicts are non-deterministic. A passing CI run on Tuesday tells you nothing about Wednesday’s run.
  • NEVER use claude -p. Subagent dispatch goes through the Agent tool exclusively. The Agent tool returns a structured result into the parent’s context, respects session permissions and skills, and is observable in the conversation. claude -p is opaque — it produces stdout text the parent must re-parse, starts a fresh process that may not see project skills, and stalls without clean error signaling.
  • NEVER inline verdict criteria into ad-hoc prompts. The criteria live in the test-flow skill so they survive across runs and across team members.

Prefer project-scope test-flow skills

The test-flow skill MUST be authored under .claude/skills/test-flow-<topic>/SKILL.md (project-scope, checked into the repo), not under $HOME/.claude/skills/test-flow-<topic>/ (personal-only).

Why this matters:

  • The verdict procedure is part of the codebase’s testing strategy. Teammates need to see it, review it, and update it as the UI evolves.
  • Personal-only skills create verdict drift — a teammate running the “same” test against a different procedure produces incomparable results.
  • Project-scope skills are picked up by the repo’s setup:doc-skill script and symlinked into ~/.claude/skills/ on each developer’s machine.
  • The test-flow skill is the durable artifact of an L6 run. Losing it loses the test.

If you find yourself authoring a test-flow skill under $HOME/.claude/skills/, stop and re-author it under the project’s .claude/skills/. Then commit it in the PR alongside the code that motivated it.

Workflow

Two halves: author the test-flow skill, then dispatch the verification subagent.

Half 1 — author the test-flow skill

A test-flow skill at .claude/skills/test-flow-<topic>/SKILL.md captures:

  • What scenario to drive (the exact user-reproduce flow — open template, drop fixture, click button, etc.)
  • What to capture (which screenshots, which DOM measurements, which evidence)
  • The verdict criteria (specifically: what counts as PASS vs FAIL, tolerance numbers, threshold ratios). Make these mechanical wherever possible. Only the genuinely subjective part remains AI judgment.
  • The output format (a JSON-like structured result with named fields the subagent must return)

The skill is per-task, not per-app. A project will accumulate multiple test-flow skills.

Authoring checklist

  • Name follows convention: test-flow-<short-topic-slug> (e.g. test-flow-canvas-image-parity).
  • Description includes the trigger keywords plus a one-line “use when” — the test-flow skill is triggered by the verification subagent’s prompt, so it must load when the subagent reads its instructions.
  • Body is self-contained — the subagent starts fresh with NO conversation history. Everything needed to drive and verdict the test goes in the body.
  • Procedure is numbered and concrete — exact selectors, exact URLs, exact viewport sizes, exact fixture paths.
  • Verdict criteria are mechanical where possible (tolerance numbers, pixel deltas) and AI-judgment-only where unavoidable.
  • Output schema is explicit — what fields the subagent must return (e.g. stageAImageWidth, stageBImageWidth, ratio, verdict, summary, screenshotPaths).
  • Skill is committed to the repo (project-scope), not left in $HOME/.claude/skills/.

Use the skill-creator skill’s init_skill.py to scaffold the new test-flow skill, then write its body.

Half 2 — dispatch the verification subagent

After the test-flow skill is written, dispatch a subagent via the Agent tool:

Agent({
  subagent_type: "general-purpose",
  description: "<short description>",
  prompt: `<self-contained brief — see template below>`,
})

The subagent’s prompt must include:

  • Goal: one sentence describing what verdict to produce.
  • Skills to load: invoke /test-flow-<topic> (the just-authored skill) AND a browser-driving skill — /verify-ui for computed-styles / screenshot capture, or /headless-browser for multi-step interactive flows.
  • Inputs: per-run inputs the test-flow skill needs (preview URL, fixture path, viewport size).
  • Output contract: match the output schema declared in the test-flow skill.

Subagent prompt template

You are a verification subagent. Produce a structured verdict using the test-flow skill below.

## Goal
{one-sentence verdict goal, e.g. "Determine whether the Stage B (second-canvas) image visually matches the Stage A (first-canvas) image at default landing viewport."}

## Skills to load
- /test-flow-<topic>  — the test procedure and verdict criteria. Read this first.
- /verify-ui          — primary browser-driving skill (computed-styles + screenshots).
- /headless-browser   — fallback if /verify-ui doesn't fit the task shape.

## Inputs
- Preview URL: <resolved URL — pass from the parent>
- Fixture: <path or asset reference>
- Viewport: <e.g. 1440x900>
- Any other per-run knobs the test-flow skill expects

## Output contract
Return a structured result message containing exactly these fields:
{ <list each field from the test-flow skill's output schema> }

Plus a `summary` field with a one-line human-readable verdict.

## Don'ts
- Don't improvise the test procedure — follow /test-flow-<topic> exactly.
- Don't change the verdict tolerance — it is locked in /test-flow-<topic>.
- Don't post anywhere — return the result to me; I (the parent agent) handle posting.

After the subagent returns

The parent receives the structured result and decides what to do with it: attach to the PR as evidence, write to a tracked evidence directory, gate a workflow step, etc. The test-flow skill stays on disk for reuse — next time the same test class is needed, the existing skill is invoked without re-authoring.

Archive results for auditability

Because L6 verdicts are non-repeatable, archive the evidence after every run:

  • Screenshots from the run
  • The structured verdict (the full output schema, not just PASS/FAIL)
  • The exact prompt the subagent ran
  • Any internal-state dumps captured during the run

Default destination: the GitHub issue or PR comment that drove the work. Screenshots accumulate quickly across runs; committing them into the repo bloats the working tree over time. Issue/PR comments keep the evidence linked to the conversation that produced it and the repo stays lean. The repo-side alternative — an evidence/ directory or similar — is a valid choice when issue/PR comments aren’t available or when project policy requires evidence to travel with the code. Absent a specific policy, post to the issue/PR comment.

Choosing the browser-driving skill — primary vs fallback

SkillBest forWhen to fall back
/verify-uiDeterministic computed-style checks; cross-stage CSS / layout parity assertionsCannot drive multi-step UI flows beyond single-page reads
/headless-browserMulti-step interactive flows (drag-drop a file, click → screenshot → click → screenshot); element bounding-rect reads via Playwright CLISlightly heavier; only use when /verify-ui can’t reach the test surface

The test-flow skill should name BOTH so the subagent picks based on the task shape. If /verify-ui returns “cannot perform this flow” the subagent switches to /headless-browser without re-prompting the parent.

Reusability — the test-flow skill outlives the test

A test-flow skill is not a one-shot scaffold for a single PR. It is a permanent artifact that captures “how to verify this class of behavior in this codebase.” When a similar test is needed later (regression check, repeated verification across PRs), invoke the same test-flow skill — the AI subagent gets the same context and produces consistent verdicts.

Sign that you’re using this pattern correctly:

  • The test-flow skill is checked into the project’s .claude/skills/ (project-scope, shared with the team).
  • Subsequent invocations DO NOT re-author the skill — they just dispatch a fresh subagent that loads it.
  • Updates to the procedure happen by editing the test-flow skill, not by inlining new instructions in the subagent prompt.

Risks and limitations

  • Non-deterministic. Same flow, different verdicts across runs. A single PASS is one data point.
  • Cost-bearing. Each run spawns a subagent, drives a browser, consumes tokens. Cost is non-trivial even locally; on CI it compounds rapidly and unpredictably.
  • Hallucination risk. An AI judge can confidently report PASS on a broken UI if criteria are vague. Tighten criteria with explicit thresholds wherever possible.
  • Verdict drift across model versions. Edge cases may judge differently as models change. Lock procedure in the test-flow skill; don’t inline criteria into ad-hoc prompts.
  • Not reproducible. A failed run cannot be replayed deterministically. The output schema must include screenshots and internal-state dumps so failures are investigable even when not reproducible.

Example skeleton — what a real test-flow skill looks like

---
name: test-flow-canvas-image-parity
description: Verify the Stage B (second-canvas) image visually matches the Stage A (first-canvas) image at default landing viewport. Use when /verify-ui-ai dispatches a subagent for canvas-image-parity verification.
---

# Test flow: canvas image parity (Stage A vs Stage B)

## Scenario
1. Open <preview URL from inputs> at viewport 1440x900.
2. Open the entry that lands on Stage A (the first canvas).
3. Begin the selection / cropping mode for Stage A.
4. Drop the fixture image (e.g. `<repo>/e2e/fixtures/red-100-fits-canvas.png`) onto the Stage A canvas layer.
5. Capture screenshot A (Stage A with the placed image visible).
6. Advance to Stage B (e.g. click the "commit and open next stage" affordance).
7. Wait for the Stage B canvas to mount and become visible.
8. Capture screenshot B (Stage B with the placed image visible).

## Measurements
- Stage A image width (CSS px): read via a test-only hook on the Stage A layer state plus the Stage A canvas CSS scale.
- Stage B image width (CSS px): read via a test-only hook on the Stage B state plus the Stage B camera zoom and canvas CSS rect.
- ratio = stageB / stageA.

## Verdict
PASS if ratio ∈ [0.95, 1.05] (±5%). FAIL otherwise.

## Output schema
{
  stageAImageWidth: number,
  stageBImageWidth: number,
  ratio: number,
  delta: number,
  verdict: "PASS" | "FAIL",
  summary: string,
  stageAScreenshot: string (path),
  stageBScreenshot: string (path),
  toolUsed: "verify-ui" | "headless-browser"
}

The example shows the shape; the verification subagent reads this and follows the procedure verbatim. For the broader testing-strategy context, see the project’s Level 6 documentation page (src/content/docs/testing-levels/level-6-ai-based-verification.mdx).

Revision History