Skip to content

Create an explainer-review skill.#39

Open
jyasskin wants to merge 3 commits into
w3ctag:mainfrom
jyasskin:review-skill
Open

Create an explainer-review skill.#39
jyasskin wants to merge 3 commits into
w3ctag:mainfrom
jyasskin:review-skill

Conversation

@jyasskin

Copy link
Copy Markdown
Member

This is missing a bunch of infrastructure that we ought to use to test and optimize it, but it's working reasonably well to catch common mistakes in the explainers I've run it over. I'd love to get other folks' AI-optimization tips.

@jyasskin

jyasskin commented Apr 29, 2026

Copy link
Copy Markdown
Member Author

One tester ran this over a Google doc, where one of the doc's comments advised the author to add some IDL to the proposed solution. This caused the LLM to incorrectly attribute the "add IDL" advice to the TAG, even though https://www.w3.org/TR/explainer-explainer/#template mentions "Do not include WebIDL in this section." This skill suggests loading https://raw.githubusercontent.com/w3ctag/explainer-explainer/refs/heads/main/index.bs instead, which only includes a bikeshed instruction to include that content, so the LLM probably didn't load our contrary advice. We could fix this by having the skill say to also load the template, but since the template structure is optional, I think we'll be better off ensuring that all guidance in the template is also somewhere in the main document. This is probably also better for human readers.

@marcoscaceres

Copy link
Copy Markdown

I've been experimenting with similar self-review prompts. Some ideas for where this could scale, responding to your invitation for tips.

On the misattribution bug: it points to a structural principle. LLMs are bad at distinguishing "this document says X" from "someone near this document said X." A guardrail that helps: require every recommendation to cite a specific section of a fetched reference doc. If the tool can't point to a source, it can't make the claim. Pair with an explicit marker distinguishing "documented guidance [source, §section]" from "tool's analysis (not a TAG position)."

What if this covered the full review surface? The review templates ask about the Explainer Explainer, Design Principles, and S&P Questionnaire explicitly. But the TAG also evaluates accessibility, i18n, and societal impact during reviews. A more comprehensive self-assessment could help authors address all of those before submitting.

Possible architecture:

  1. Fetch & classify — Read the explainer, fetch reference docs, assess maturity. Coach early drafts; analyze polished ones.
  2. Structural completeness — ≈ your current skill.
  3. Domain detection — Determine which deeper checks apply based on what the explainer describes. A networking API skips i18n. A UI feature triggers the accessibility screener questions. Skip irrelevant domains explicitly.
  4. Question-driven checks — For each relevant domain, ask what the source document prescribes:
  5. Analytical checks (clearly marked as tool's analysis, not TAG guidance) — gap search for existing specs, audience simulation (naive/malicious/third-party hands), implementability, multi-stakeholder signals.
  6. Verdict table — Dimensions, status, action needed. Tiered: blocking / gap / suggestion.

Key guardrails:

  • No claims about TAG behavior ("the TAG typically..." is wrong, as your bug showed)
  • Citation-only: every recommendation must cite a fetched source section
  • Epistemic markers: "documented guidance [R#, §X]" vs. "tool's analysis"
  • Graceful degradation: if URLs are listed explicitly, the skill works as a copy-paste prompt in any LLM

Happy to help iterate on any of this.

@marcoscaceres

Copy link
Copy Markdown

Following up on the architecture I sketched above, here's roughly what I'm thinking as a more complete skill definition. I haven't tested this yet, so treat it as a strawperson for discussion rather than something ready to ship.

The idea is that this could work as a copy-paste prompt in any LLM (the reference URLs are listed explicitly), or as a structured skill in tools that support them.

Explainer Pre-Flight skill (draft)
# Explainer Pre-Flight

Comprehensive self-review for web platform explainers before TAG design review. Covers structural completeness, design principles, security/privacy, accessibility, internationalization, societal impact, and implementability.

## References

DO NOT rely on memorized copies of these documents. ALWAYS fetch them to get the latest version. When citing a section, verify the anchor exists in the copy you fetched.

| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |

For the Design Principles (R2), which is large: if you can run subagents, use one to extract relevant sections without consuming your main context.

## Guardrails

These rules apply to ALL output from this skill:

1. **Citation-only.** Every recommendation must cite a specific section of a fetched reference document. If you cannot point to a source, you cannot make the claim.
2. **Epistemic markers.** Label every finding as either:
   - "Documented guidance [R#, §section]" — a requirement from a TAG/W3C document
   - "Tool's analysis" — your reasoning, explicitly not a TAG position
3. **No claims about TAG behavior.** Never say "the TAG typically...", "reviewers often...", or "the TAG will likely..." These predictions are unreliable and undermine credibility. Say what the *documents* say.
4. **Fetch-and-quote.** When citing a reference, quote the relevant passage. If a fetch fails, suppress that check with a note rather than proceeding from memory.

## Workflow

### Step 1: Fetch & Classify

1. Fetch the target explainer.
2. Fetch reference documents R1-R7. Note any fetch failures.
3. Classify the explainer's maturity:
   - **Skeletal:** Missing most expected sections.
   - **Developing:** Sections present but thin or incomplete.
   - **Polished:** Substantive content throughout.

Maturity determines voice: coach skeletal explainers ("when you write this section, address..."), analyze polished ones ("your accessibility section mentions X but the screener also asks about Y because...").

**STOP** if the explainer cannot be fetched or is empty. Report the failure only.

### Step 2: Structural Completeness (R1)

Check the explainer against the Explainer Explainer's key components:
- User need / problem statement
- Why existing technology is insufficient
- Proposed solution with code examples
- Alternatives considered
- Security and Privacy considerations
- Open questions / known limitations

For each: present (quote it), missing (explain what belongs there, cite R1), or weak (quote what's there vs. what R1 asks for).

**STOP** if the explainer is skeletal to the point of being circular ("the feature is needed because it is useful"). Report structural gaps only. Deeper analysis of a phantom helps no one.

### Step 3: Domain Detection

Based on what the explainer describes, determine which Step 4 checks apply:

| Signal in explainer | Triggers |
|---------------------|----------|
| UI elements, forms, input, keyboard, visual rendering | Accessibility (R6) |
| Text, locale, encoding, bidi, non-Latin scripts | i18n (R7) |
| User data, permissions, device/sensor access, state | S&P (R3) |
| Societal-scale impact, surveillance, centralization, environmental | Societal (R4, R5) |
| Any web-facing feature | Design Principles (R2) — always runs |

Report which checks will run and which are skipped (with one-line reason).

### Step 4: Question-Driven Checks

For each triggered domain:

**4a. Design Principles (R2)**
Identify the 3-5 principles most relevant to this proposal. Quote each verbatim. Assess compliance. At minimum always check: "Put user needs first", "It should be safe to visit a web page", "Prefer simple solutions."

**4b. Security & Privacy (R3)**
Identify relevant questions from the questionnaire. Has the explainer answered them (inline or linked)? Flag unanswered questions that clearly apply.

**4c. Accessibility (R6)**
Run the screener questions. If any trigger positive, note that Accessible Platform Architectures (APA) review will likely be needed. Cite relevant WCAG success criteria if identifiable.

**4d. Internationalization (R7)**
Check relevant items: text handling, locale sensitivity, bidi, encoding, character processing. Flag gaps.

**4e. Societal Impact (R4 + R5)**
Check relevant ethical principles (R4) and societal impact questions (R5). Environmental sustainability applies if the feature has energy/resource/network implications.

**4f. Implementability** *(Tool's analysis)*
- Does the proposed solution demonstrate the author understands what building this requires?
- Are there performance, engine architecture, or cross-platform feasibility concerns unacknowledged?
- Do the "alternatives considered" reflect real engineering trade-offs, or just aesthetic preferences?

### Step 5: Gap & Stakeholder Analysis *(Tool's analysis)*

Clearly mark this entire section as analytical, not TAG guidance.

- **Gap search:** Are there existing specs (W3C, WHATWG, IETF, TC39) that already solve part or all of this problem? If found, quote the relevant section. Frame as "these existing specs may be relevant" not "this is already solved."
- **Audience simulation:** What happens when this feature is used by:
  - A developer who doesn't read the docs (naive hands)
  - Someone deliberately trying to abuse it (malicious hands)
  - A third-party script without user-interest alignment
- **Multi-stakeholder signals:** Is there engagement beyond a single vendor? Are browser positions filed? This maps to the TAG review form's "Feedback so far" field.

### Step 6: Verdict Table

Present findings in a summary table:

| Dimension | Status | Source | Action needed |
|-----------|--------|--------|---------------|
| Problem statement | Clear / Vague / Missing | R1 | |
| Existing solutions | Addressed / Gap / Not checked | Analysis | |
| Proposed solution | Clear / Hand-wavy / Missing | R1 | |
| Design Principles | Compliant / Issues / Conflicts | R2 §X | |
| S&P Questionnaire | Answered / Gaps / Missing | R3 | |
| Accessibility | N/A / Addressed / Gaps | R6 | |
| Internationalization | N/A / Addressed / Gaps | R7 | |
| Societal/environmental | N/A / Addressed / Gaps | R4, R5 | |
| Implementability | Demonstrated / Unclear | Analysis | |
| Multi-stakeholder | Broad / Emerging / Single-vendor | Analysis | |
| Alternatives considered | Substantive / Shallow / Missing | R1 | |
| Ready for TAG review | Yes / After [list] / Not yet || |

Tier each finding: **Blocking** (must fix before submitting), **Gap** (should address), **Suggestion** (would strengthen).

## Constraints

- DO NOT claim the TAG will or won't approve a proposal. This tool helps authors prepare; it does not predict outcomes.
- Help authors keep the explainer concise. If something is already well-handled, don't belabor it.
- Express analytical findings (Steps 4f, 5) tentatively. Frame as questions to consider, not verdicts.
- If a reference doc fetch fails, state which check was skipped and why. Do not proceed from memory.

The guardrails section is the part I feel strongest about. The misattribution bug you found is exactly the kind of thing "citation-only" and "no claims about TAG behavior" are designed to prevent. The domain detection (Step 3) is the part I'm least sure about: whether skipping checks is worth the complexity vs. just running everything every time.

Curious what you think, and whether this is roughly the right scope or if it should be narrower for a first pass.

@marcoscaceres

marcoscaceres commented May 6, 2026

Copy link
Copy Markdown

Based on your feedback, I iterated the skill through three rounds and validated against actual TAG reviews. Here's what I found.

Changes since v2

  • Explicit severity criteria (Blocking/Gap/Suggestion with calibration test)
  • Document validation gate (rejects non-explainers)
  • "Closest existing primitive" check — asks whether the explainer argues why extending an existing API was rejected (this is the most common TAG architectural objection in my testing)
  • Interop coercion risk check
  • R4 (Ethical Web Principles) for societal-impact proposals
  • Tone: "the documents ask about X" not "you failed at X"; grouped thematically

Validation against real TAG reviews

Tested against 9 proposals from w3ctag/design-reviews + 1 charter (which the tool correctly rejects):

Proposal TAG review TAG outcome Tool: B/G/S
Topics API #726 Unsatisfied 3/11/3
Document PiP #798 Unsatisfied 3/8/4
Prompt API #1093 Lack of consensus 8/15/0
CPU Performance API #1198 Open (concerns raised) 3/10/3
EyeDropper API #587 Satisfied 0/3/1
scheduler.postTask #338 Satisfied 0/2/1

Calibration holds: TAG-unsatisfied → 3+ Blocking. TAG-satisfied → 0 Blocking, short report.

Caveat: I'm running the tool against the current explainers but comparing against TAG feedback from the time of review. Some explainers have been updated since. This means some "tool misses" might be things the explainer already fixed post-review.

What the tool catches vs. misses

Catches: Structural gaps (R1), S&P questionnaire (R3), priority of constituencies violations (R2 §1.1), "why not extend the existing primitive?" (R2 §new-features), fingerprinting, accessibility screener (R6), interop coercion, ethical/societal concerns (R4).

Misses (inherent): "You're solving the wrong problem," specific architectural alternatives that require platform expertise, cross-feature interactions, OS-level implementability, cultural/contextual concerns.

The boundary: The tool reads the explainer's arguments and evaluates their completeness against reference documents. It catches "your argument is incomplete" but can't reach "you should use a different API entirely." That's the TAG's highest-value contribution and isn't replaceable.

Open questions

  1. Is domain detection (Step 3) worth the complexity? Checks that find nothing produce no output anyway.
  2. Should the "closest existing primitive" check be more aggressive (Blocking instead of Gap when the explainer doesn't address the obvious primitive at all)?
  3. I tested with Claude. Would be great to see how this performs with Gemini — the skill is designed to be model-agnostic so it should work as a copy-paste prompt. If you get different calibration results, that's useful eval data.
Updated skill (v3)
---
name: explainer-preflight
description: Use when reviewing a web platform explainer or proposal before TAG design review. Also use when asked to check an explainer against TAG documents, prepare for a design review submission, or self-assess an explainer's readiness. Not for reviewing finished specs, implementations, or TAG review responses.
---

# Explainer Pre-Flight

Self-review tool for web platform explainers before TAG design review. Walks through TAG reference documents and surfaces gaps: where the explainer is thin or silent on things the TAG documents ask for.

## When NOT to Use

- Reviewing a finished W3C specification (use spec review tooling)
- Reviewing an implementation or code (not a design review tool)
- Responding to TAG review feedback (that's a different workflow)
- Evaluating TAG reviews themselves

## References

ALWAYS fetch these from the internet to get the latest version. Never rely on memorized copies. When linking to anchors, verify that the anchor exists in the copy you just downloaded, and that the content of the section supports the point it's being cited to support. Check citations before presenting them to the user, using a subagent if possible.

| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |

For Design Principles (R2), which is large: use a subagent to extract relevant sections if possible.

## Guardrails

1. **Citation-only.** Every finding must cite a specific section of a fetched reference document, verified by subagent (Step 5). No source, no claim.
2. **Epistemic markers.** Label findings: "Documented guidance [R#, §section]" vs. "Tool's analysis."
3. **No claims about TAG behavior.** Never "the TAG typically..." or "reviewers often..." Say what the documents say.
4. **Gaps only.** Do not mention things the explainer handles adequately. Do not compliment. Output is proportional to problems.

## Severity Criteria

Consistent severity is critical. Use these definitions:

- **Blocking:** The TAG review form explicitly requires this (e.g., R1 says "key components" or R3 requires dedicated considerations sections), AND the explainer is completely silent on it. A reviewer cannot proceed without this information.
- **Gap:** A reference document asks a relevant question or states a relevant principle, AND the explainer is thin or silent, but the review could proceed. The TAG would likely raise this but it wouldn't stop the review.
- **Suggestion:** The tool's own analysis identifies something that would strengthen the explainer, or a reference document principle is tangentially relevant. Not something the documents require.

**Calibration test:** If the finding is "section X is entirely missing and R1 lists it as a key component" → Blocking. If "R3 asks question Y and the explainer doesn't answer it but also doesn't claim to" → Gap. If "the explainer could be stronger here" → Suggestion.

## Workflow

### Step 0: Document Validation

Before starting analysis, confirm the target URL is actually an explainer:
- Does it have a problem statement, proposed solution, or similar structure?
- If it appears to be a polyfill README, spec text, changelog, or redirect: **STOP.** Report that the URL is not an explainer. If you can identify the actual explainer (e.g., linked from the repo, from a TAG design review issue, or from an Open UI/WICG page), note its location and ask whether to analyze that instead.

### Step 1: Fetch & Classify

1. Fetch the target explainer.
2. Fetch reference documents R1-R7. Note any fetch failures.
3. Classify maturity: **Skeletal** (missing most sections), **Developing** (thin), **Polished** (substantive throughout).

Maturity determines voice: coach skeletal ("when you write this section, address..."), analyze polished directly.

**STOP** if the explainer cannot be fetched or is empty.

### Step 2: Structural Completeness (R1)

Check against the Explainer Explainer's key components. Report only gaps:
- Missing: explain what belongs there, cite R1.
- Weak: quote what R1 asks for, show where the explainer falls short.

**STOP** if skeletal to the point of being circular. Report structural gaps only.

### Step 3: Domain Detection

| Signal in explainer | Triggers |
|---------------------|----------|
| UI, forms, input, keyboard, visual rendering | Accessibility (R6) |
| Text, locale, encoding, bidi, non-Latin scripts | i18n (R7) |
| User data, permissions, device/sensor access, state | S&P (R3) |
| Societal-scale impact, surveillance, centralization, AI/ML, advertising, content generation | Societal (R4, R5) |
| Any web-facing feature | Design Principles (R2) — always |

Report which checks run and which are skipped (one-line reason).

### Step 4: Question-Driven Checks

For each triggered domain, report only where the explainer is thin or silent:

- **4a. Design Principles (R2):** Surface only principles the explainer doesn't address. Quote each and explain relevance.
- **4b. Security & Privacy (R3):** Flag only unanswered questions that clearly apply.
- **4c. Accessibility (R6):** If screener triggers positive, note APA review needed. Cite WCAG.
- **4d. Internationalization (R7):** Flag only gaps.
- **4e. Societal Impact (R4 + R5):** Flag only gaps.
- **4f. Platform Fit** *(Tool's analysis, informed by R1 and R2):*
  - **Closest existing primitive:** What is the closest existing web platform feature to this proposal? (e.g., `window.open()` for new windows, Popover API for overlay UI, MediaCapabilities for media adaptation). Does the explainer explicitly argue why extending that primitive was rejected? R2 (§prefer-simple-solutions) states "prefer simple solutions" and (§new-features) states "add new capabilities with consideration of existing functionality." If the explainer proposes a new API surface without addressing the obvious existing primitive, flag it. This is the single most common TAG architectural objection.
  - **Interop coercion risk:** If the feature only delivers value when all browsers implement it, and there's no graceful degradation path, sites may coerce users into specific browsers. R2 (§devices-platforms) requires features to work "across a range of platforms." Flag if the explainer doesn't discuss what happens when the feature is absent and whether sites could weaponize feature detection.
  - **Cross-feature interactions:** Does the explainer address how this interacts with other platform features that operate in the same user-facing domain? (e.g., hover behavior interacting with Speculation Rules prefetch-on-hover; new window APIs overlapping with Popover; device APIs overlapping with existing sensor APIs). If the feature shares a user-facing surface with another feature and the interaction is unaddressed, flag it.
  - **Evidence of utility:** R1 (§user-research) encourages user research. Beyond that, does the explainer provide evidence that the proposed solution actually works? Not theoretical scenarios, but concrete developer feedback, usage data, or prototype results demonstrating the design is correct.

### Step 5: Citation Verification

Before presenting findings, verify every citation. Use a subagent if possible.

For each citation:
1. **Anchor exists** in the fetched document.
2. **Content supports the claim** being made.

Remove any citation that fails. Drop any finding with no surviving citation. This is architectural, not optional.

### Step 6: Gaps Report

For each document analyzed, present only gaps:

> **Design Principles (R2):** Your explainer doesn't address [principle] (§anchor), which is relevant because [quoted passage].
>
> **S&P Questionnaire (R3):** Questions N and M are relevant but aren't addressed. [Quote what they ask.]

- Tier findings: **Blocking** / **Gap** / **Suggestion** per the Severity Criteria above.
- End with one line: "Ready for TAG review" or "Address N blocking gaps first."

**Tone for proponents:** This report is meant to help authors prepare, not to gatekeep. Frame findings as "the documents ask about X, and your explainer doesn't address it yet" rather than "your explainer fails to..." The goal is that an author reads this and knows exactly what to add, not that they feel judged.

**Structure for readability:** Group findings thematically (security cluster, structural cluster, platform fit cluster) rather than as a flat numbered list. A report with "8 Blocking" findings feels like a failing grade. A report organized as "here's a security cluster you need to address, and separately here's a structural cluster" feels like actionable guidance.

## Common Mistakes

| Mistake | What happens |
|---------|--------------|
| Analyzing a non-explainer document (polyfill README, spec text, redirect) | Produces 10+ structural "blocking" findings that are just "section missing" — useless noise |
| Citing an anchor without verifying content supports the claim | Misattribution: tool says TAG requires X when the document says the opposite |
| Running from memorized/cached documents | Stale guidance: checks against outdated versions that may have changed |
| Praising what's done well | Verbosity explosion, especially for large documents like Design Principles |
| Claiming "the TAG will..." | Undermines credibility; LLMs guess wrong about reviewer behavior |
| Everything is "Blocking" | Severity inflation makes the report feel hostile; use the calibration test |

## Evaluation

When this skill is modified, compare outputs before and after:

1. Run old version over test explainers (good, mediocre, bad).
2. Run new version over same explainers.
3. Present what changed: gaps added/removed, citation changes, severity shifts.

Use whatever tooling is available (parallel agents, worktrees, diff tools). If none, run sequentially and diff outputs.

Happy to run it on more proposals if you want to kick the tires.

@marcoscaceres

Copy link
Copy Markdown

updated the "v2 skill" above using /superpowers' /writing-skills.

@marcoscaceres

Copy link
Copy Markdown

Iterated further. A few refinements since v3:

  • Finding rollup. Previous versions over-flagged polished explainers because walking every R6 question and every R3 question produced a separate Gap even when the underlying issue was just "no Accessibility section." Now: a missing structural section produces one rolled-up Gap citing R1's key component 6, with the implied downstream questions enumerated inline. Sub-questions are emitted separately only when the section exists but is thin. On the validation set: EyeDropper 9 → 3 findings, Spell Check 14 → 10, interesttarget held at 4 Blocking with Gaps dropping 15 → 6.
  • FAST pointer. When R6 Q1/Q4/Q5 trigger Yes and the UI is substantive, the skill now suggests engaging APA FAST for deeper self-assessment. Marked Suggestion — TAG docs don't mandate FAST, but APA expects it at horizontal review.
  • WG-driven case. Step 0 now checks whether the explainer already links to horizontal-review issues in w3c/a11y-request, w3c/i18n-request, w3cping/privacy-request, w3c/security-request. If yes, the report cites those rather than re-deriving the concerns.
  • Citation cleanup. An ai-skeptic pass caught a fabricated R2 anchor (#support-lower-end-devices doesn't exist; real one is #devices-platforms) and a paraphrased R2 quote. Both fixed. R3 question count tightened from "~22" to exactly 22.
v4 skill
---
name: explainer-preflight
description: Use when reviewing a web platform explainer or proposal before TAG design review, when asked to check an explainer against TAG documents, when preparing for a design review submission, or when self-assessing an explainer's readiness. Also use when the user shares a WICG, Open UI, or explainer URL and asks "is this ready for TAG review", "what would TAG flag", or "what's missing." Not for reviewing finished W3C specifications, implementations, code, TAG review responses, or for evaluating TAG reviews themselves.
---

# Explainer Pre-Flight

Self-review for web platform explainers before they go to TAG design review. Walks the TAG reference documents and reports only gaps — where the explainer is thin or silent on things those documents ask about. A good explainer gets a short report; a thin one gets a longer one. Output is proportional to problems.

The tool catches mechanical gaps (missing sections, unanswered S&P questions, ignored design principles). It does not replicate the architectural wisdom that makes TAG reviews valuable ("should this be part of Popover instead?"). That is the reviewer's job.

## When not to use

- Reviewing a finished W3C specification → use spec-review tooling instead.
- Reviewing an implementation or code → this is a design-review tool, not a code reviewer.
- Responding to TAG review feedback you've already received → different workflow (drafting a response to the TAG, not preparing for one).
- Evaluating a TAG review itself.
- Reviewing a charter, Working Group process doc, or horizontal-review request — these aren't explainers.

## Reference documents

Fetch all seven from the network every run. Never rely on memorized copies — guidance changes.

| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire (Editor's Draft) | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |

R2 is large. Dispatch a subagent to extract only the principles relevant to the proposal's domain. R5 is an Editor's Draft; note that when citing.

## Rules for every finding

1. **Cite or drop.** Each finding must quote a specific passage from one of R1–R7 and link the source anchor. If Step 5 (Citation Verification) can't confirm the anchor exists and supports the claim, the finding goes in the bin.
2. **Label the source.** Prefix findings with the document tag: *"R3, §2.16:"*, *"R1, §introduction (key components):"*. For tool-derived analysis (Step 4f), prefix with *"Tool's analysis:"* so the reader knows the documents don't explicitly require it.
3. **Describe what the documents say, not what reviewers will do.** Never *"the TAG typically…"*, *"reviewers often…"*, *"Anne would push back on this."* LLMs guess wrong about human behavior. Stick to what the fetched documents state.
4. **Report gaps only.** Don't list things the explainer handles adequately. Don't compliment. Authors can read praise elsewhere.

## Severity

Three tiers. Apply them consistently — inflation makes the report feel like a failing grade and authors stop reading.

- **Blocking** — the document explicitly requires it *and* the explainer is completely silent. Example: R1 lists "Security and Privacy Considerations" as a key component and the explainer has no such section.
- **Gap** — a document asks a relevant question and the explainer doesn't answer. Example: R3 Question 2.3 applies but isn't addressed. The review could still proceed; TAG would likely raise it.
- **Suggestion** — something the tool's own analysis flags, or a principle that's tangentially relevant. The documents don't mandate it. Example: the explainer could strengthen the alternatives section.

**Calibration test** before writing a finding: can you point to a verbatim requirement in a fetched document? If yes → Blocking or Gap depending on silence vs. partial answer. If no → Suggestion.

## Finding rollup

Structural gaps imply downstream gaps. An explainer with no Accessibility Considerations section at all will, by definition, fail to address R6 Q1, Q5, Q6, etc. Emitting each of those as a separate Gap makes a short-handed explainer look catastrophically broken — it doesn't. Roll them up.

**Rule:** when a structural finding (e.g., "no Accessibility Considerations section" from R1 §introduction key component 6) would force downstream question-level Gaps, emit the structural finding *once* and enumerate the implied downstream questions *inside* it. Do not count each implied question as a separate Gap.

**Worked example** — an explainer with a Privacy section but no Security section, no Accessibility section, and no i18n section:

-**Gap** — R1 §introduction key component 6 + R3 §2.16: no dedicated Security Considerations section, no Accessibility Considerations section, no Internationalization Considerations section. When you write the Accessibility section, R6 Q1 (new visual UI), Q5 (input interpretation change), and Q6 (modality constraint) will apply to this feature and need specific answers.
- ❌ Do not emit: separate Gaps for R3 §2.16, R6 Q1, R6 Q5, R6 Q6, and the R1 missing-section finding. That's five Gaps counting the same underlying problem five ways.

If the explainer *does* have an Accessibility section but it's thin, then R6 questions become separate Gaps — one per unanswered question — because each is an independent authoring failure, not an implied consequence of a missing section.

## Workflow

The six steps are serial. Each has a stop condition.

### Step 0: Confirm the URL is an explainer

Before fetching anything else, sanity-check the target:
- Does it have a problem statement, proposed API/solution, and at least some use cases? That's an explainer.
- If it looks like a polyfill README, a spec document (has "Status of this Document"), a changelog, a redirect stub, or a charter, **STOP**. Report what the URL actually is. If the explainer lives elsewhere (linked from the repo, the TAG design-review issue, or a WICG/Open UI page), point at it and ask whether to analyze that.
- Note the origin. If the explainer is from a formal Working Group (rather than an incubation venue like WICG/Open UI), check whether it already links to horizontal-review issues in `w3c/a11y-request`, `w3c/i18n-request`, `w3cping/privacy-request`, or `w3c/security-request`. If yes, cite those in the report rather than re-deriving the concerns the horizontal-review groups already raised.

Why this matters: running structural checks against a non-explainer produces ten "section missing" findings that are noise, not signal. Gemini did this on the interesttarget polyfill README in v1 testing.

### Step 1: Fetch and classify

1. Fetch the explainer. If it 404s, try one retry with a likely fallback URL (e.g., `raw.githubusercontent.com` for GitHub blob links, the repo's `main` branch if a specific commit is referenced). If the TAG design-review issue for this proposal is accessible, check it — the real explainer URL is usually linked there. Only STOP if no fallback finds a reachable explainer.
2. Fetch R1–R7. Note any fetch failures; don't substitute from memory. If any reference document 403s, try the raw URL or an alternate mirror before marking it unfetched.
3. Classify maturity:
   - **Skeletal** — most R1 key components missing.
   - **Developing** — sections present but thin.
   - **Polished** — substantive content throughout.

Maturity sets the voice, not the rigor. For skeletal: *"when you write this section, address…"*. For polished: *"R3 §2.8 asks X; the explainer addresses only Y."*

### Step 2: Structural completeness against R1

R1's `#introduction` lists six key components (fetch for the current list; at time of writing they are: Discussion Venues, User-Facing Problem, Proposed Approach, Practical Use Cases, Alternatives Considered, and "Accessibility, Internationalization, Privacy, and Security Considerations"). For each:
- **Missing** — flag as **Blocking** and quote what R1 says that section should contain.
- **Present but thin** — flag as **Gap**, quote the R1 requirement, point at the relevant paragraph in the explainer that falls short.

If the explainer is skeletal enough that Step 2 alone would produce 5+ Blocking findings — **STOP** here. Report the structural gaps and recommend the author come back after the explainer is substantive. Running deeper checks on a stub produces noise.

### Step 3: Detect which deeper checks apply

Scan the explainer for signals, then decide which of R3–R7 will run:

| Signal | Triggers |
|--------|----------|
| Any web-facing feature | R2 Design Principles — **always** |
| UI, forms, input, keyboard, visual rendering, audio | R6 Accessibility — **always if feature has user-facing surface** |
| Text, locale, encoding, bidi, non-Latin scripts, dates/numbers | R7 i18n |
| User data, permissions, device/sensor access, storage | R3 S&P |
| Societal-scale impact, surveillance, centralization, AI/ML, advertising | R4 + R5 Societal |

Report which checks ran and which were skipped (one line each). Skipped ≠ hidden — the author should see what was considered.

### Step 4: Walk the applicable reference documents

Run each triggered check and collect findings. Report only gaps.

#### 4a. Design Principles (R2) — always

R2 is large. Use a subagent to extract principles relevant to this proposal. For each relevant principle, check whether the explainer addresses it. Surface only principles the explainer ignores or contradicts. Quote the principle and explain why it applies to this feature.

#### 4b. Security & Privacy (R3)

R3 has 22 numbered questions (section 2). A subagent can check each against the explainer. Flag only questions that (a) clearly apply to this feature and (b) the explainer doesn't answer. Also verify R3 §2.16 ("Does this specification have both 'Security Considerations' and 'Privacy Considerations' sections?"): the explainer should have dedicated Security Considerations *and* Privacy Considerations sections — if either is missing or merged without calling both out, that's a Gap.

#### 4c. Accessibility (R6) — walk every question

R6 is short (7 questions at time of writing). Fetch it and walk each question in order. For each:
1. Answer the question based on the explainer ("Yes", "No", or "Unclear").
2. If the answer is Yes or Unclear, check whether the explainer contains an accessibility discussion that addresses the implication. R1's sixth key component explicitly demands this.
3. If the explainer has no accessibility section at all, emit **ONE rolled-up Gap** per the Finding rollup rule — not seven separate Gaps. The rolled-up finding cites R1 §introduction key component 6, and *enumerates inline* which of Q1–Q7 would apply and what the author needs to address when they write the section. Do not emit a separate Gap per question in this case.
4. If the explainer *has* an accessibility section but it doesn't address a specific question's implication, *that* is a separate Gap — each is an independent authoring failure, not an implied consequence.
5. After Q1–Q7, note whether the explainer acknowledges R6 itself (many authors don't know to file a screener issue — R6's own text says the screener tags APAWG on a filed GitHub issue for horizontal-review preparation). If the explainer doesn't mention filing a screener, that's a **Suggestion**.

Cite R6's question text verbatim and the form-field anchor (`#q1_yes` through `#q7_yes` — R6 doesn't have heading anchors). If a Yes triggers broader WCAG concerns (time-limited UI, keyboard inaccessibility), cite WCAG by section.

**When to point at FAST.** R6 is a short screener, not a full accessibility review. If Q1, Q4, or Q5 answered Yes and the UI is substantive (not a trivial affordance like a color picker's cursor change), note that the explainer's accessibility section should engage the APA FAST checklist for deeper self-assessment: https://w3c.github.io/fast/checklist.html. FAST covers visual rendering, user input, semantics, time-based media, audio, i18n hooks, APIs, protocols, and fallback — roughly 80 checkpoints. Flag this as a **Suggestion** (the TAG documents don't currently require FAST engagement; APA does for horizontal review).

#### 4d. Internationalization (R7)

Only if Step 3 triggered. Walk R7's checklist items relevant to the feature — text direction, locale sensitivity, formatting of numbers/dates/currencies, Unicode handling. Flag only unaddressed items.

#### 4e. Societal Impact (R4 + R5)

Only if Step 3 triggered. R4 is the published Ethical Web Principles; R5 is an Editor's Draft of the Societal Impact Questionnaire — label R5 citations as "ED, subject to change." Focus on principles/questions the explainer doesn't engage with when they clearly apply (e.g., a surveillance-capable API that doesn't discuss R4's privacy principle).

#### 4f. Platform Fit *(Tool's analysis, informed by R1 and R2)*

This is the tool's highest-value check per validation. Four sub-checks, in priority order:

- **Closest existing primitive.** What is the closest existing web platform feature to this proposal? (E.g., `<input type="color">` for color pickers, Popover API for overlay UI, MediaCapabilities for media adaptation.) Does the explainer explicitly argue why extending that primitive was rejected? R2 §new-features ("Add new capabilities with care") says: *"Add new capabilities to the web with consideration of existing functionality and content."* If the explainer proposes a new API surface without addressing the obvious existing primitive, flag it as **Gap** (or **Blocking** if there is no Alternatives Considered section at all). In validation this is the most common TAG architectural objection.

- **Interop coercion risk.** If the feature only delivers value when all browsers implement it, and there's no graceful degradation path, sites may pressure users into specific browsers. R2 §devices-platforms ("Support the full range of devices and platforms") and the priority-of-constituencies principle are relevant. Flag if the explainer doesn't discuss behavior when the feature is absent and whether sites could weaponize feature detection.

- **Cross-feature interactions.** Does the explainer address how this interacts with other platform features sharing the same user-facing surface? Examples: hover-to-preview interacting with Speculation Rules prefetch-on-hover; new window APIs overlapping with Popover; new sensor APIs overlapping with existing Generic Sensor APIs. If it shares a surface and the interaction is unaddressed, flag it.

- **Evidence of utility.** R1 §end-user-need ("Explain the End-User's need") asks authors to ground the proposal in a concrete user need. Does the explainer provide *evidence* that the solution works — concrete developer feedback, origin-trial data, bug links, prototype results? Theoretical scenarios are not evidence. Flag if none is provided.

### Step 5: Citation verification

Before writing the report, verify every citation. Dispatch a subagent (or do it inline if no subagents are available). For each finding:

1. **Does the anchor exist** in the fetched document?
2. **Does the content at that anchor support the claim** being made?

Remove any citation that fails. If a finding has no surviving citation, drop the finding. This is the check that prevents the most damaging failure mode: attributing advice to the TAG that the documents contradict.

### Step 6: Write the report

Structure findings thematically, not as a flat list:

> **Structural (R1):** [Blocking/Gap findings here]
>
> **Security and Privacy (R3):** [findings]
>
> **Accessibility (R6, walking questions 1–7):** [findings, tied to specific questions]
>
> **Platform fit (Tool's analysis):** [findings]
>
> **Skipped checks:** i18n (no text handling), Societal (no surveillance surface) — one line each.

Tone: *"R3 §2.8 asks about X, and the explainer doesn't address it yet"* — not *"your explainer fails to…"*. Authors should finish reading knowing what to add, not feeling judged.

End with one line: **"Ready for TAG review"** or **"N Blocking, M Gap findings — address Blocking items first."** Nothing else.

## Common failure modes

| Failure | Why it hurts | Prevention |
|---------|--------------|------------|
| Analyzing a non-explainer (polyfill README, spec, redirect stub) | Produces structural noise, not findings | Step 0 |
| Citing an anchor without re-reading the content at it | Misattribution — the document may say the opposite | Step 5 |
| Running against memorized/cached documents | Stale guidance; references update | Step 1: fetch every time |
| Praising work done well | Verbosity, especially with R2 | Rule 4: gaps only |
| "The TAG will…" / "Reviewers typically…" | Undermines credibility | Rule 3 |
| Severity inflation (everything Blocking) | Authors disengage | Calibration test before labeling |

## Evaluating changes to this skill

Test cases and assertions live in `evals/evals.json` alongside this skill. Four validated test explainers cover the calibration range:

| Explainer | TAG outcome | Expected skill output |
|-----------|-------------|-----------------------|
| EyeDropper API | Satisfied (#587) | 0 Blocking, ≤4 findings total |
| Spell Check Dictionary | Satisfied with concerns (#1191) | 0 Blocking (structure present), 5–10 Gaps |
| interesttarget (Interest Invokers) | Reopened after initial review (#1058) | 3+ Blocking, detailed platform-fit findings |
| Compression Dictionary Transport README | Not an explainer | Step 0 rejects, no structural findings |

When editing this skill:
1. Snapshot the current version to `SKILL.md.vN-snapshot` before changing anything.
2. Run the snapshot against all four test explainers (save to `evals/iteration-N/old_skill/`).
3. Run the new version against the same four (save to `evals/iteration-N/with_skill/`).
4. Diff the findings. Each change should be explicable in one sentence — *"removed the always-flag-BFCache rule → BFCache Gap dropped for EyeDropper, correctly"*.

If subagents are available, run the eight runs (4 explainers × 2 skill versions) in parallel. If not, run sequentially and diff the output files.

@christianliebel christianliebel self-assigned this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants