diff --git a/templates/commands/checklist.md b/templates/commands/checklist.md index 2e1b1040af..6a3d18ed37 100644 --- a/templates/commands/checklist.md +++ b/templates/commands/checklist.md @@ -11,20 +11,14 @@ scripts: **NOT for verification/testing**: -- ❌ NOT "Verify the button clicks correctly" - ❌ NOT "Test error handling works" -- ❌ NOT "Confirm the API returns 200" - ❌ NOT checking if code/implementation matches the spec **FOR requirements quality validation**: -- ✅ "Are visual hierarchy requirements defined for all card types?" (completeness) - ✅ "Is 'prominent display' quantified with specific sizing/positioning?" (clarity) -- ✅ "Are hover state requirements consistent across all interactive elements?" (consistency) -- ✅ "Are accessibility requirements defined for keyboard navigation?" (coverage) -- ✅ "Does the spec define what happens when logo image fails to load?" (edge cases) -**Metaphor**: If your spec is code written in English, the checklist is its unit test suite. You're testing whether the requirements are well-written, complete, unambiguous, and ready for implementation - NOT whether the implementation works. +**Metaphor**: If your spec is code written in English, the checklist is its unit test suite - testing whether the requirements are well-written, complete, unambiguous, and ready for implementation, NOT whether the implementation works. ## User Input @@ -37,13 +31,10 @@ You **MUST** consider the user input before proceeding (if not empty). ## Pre-Execution Checks **Check for extension hooks (before checklist generation)**: -- Check if `.specify/extensions.yml` exists in the project root. -- If it exists, read it and look for entries under the `hooks.before_checklist` key +- If `.specify/extensions.yml` exists in the project root, read it and look for entries under the `hooks.before_checklist` key - If the YAML cannot be parsed or is invalid, skip hook checking silently and continue normally -- Filter out hooks where `enabled` is explicitly `false`. Treat hooks without an `enabled` field as enabled by default. -- For each remaining hook, do **not** attempt to interpret or evaluate hook `condition` expressions: - - If the hook has no `condition` field, or it is null/empty, treat the hook as executable - - If the hook defines a non-empty `condition`, skip the hook and leave condition evaluation to the HookExecutor implementation +- Filter out hooks where `enabled` is explicitly `false`; treat hooks without an `enabled` field as enabled by default +- For each remaining hook, do **not** attempt to interpret or evaluate hook `condition` expressions: if the hook has no `condition` field (or it is null/empty), treat it as executable; if it defines a non-empty `condition`, skip it and leave condition evaluation to the HookExecutor implementation - For each executable hook, output the following based on its `optional` flag: - **Optional hook** (`optional: true`): ``` @@ -83,30 +74,24 @@ You **MUST** consider the user input before proceeding (if not empty). - Prefer precision over breadth Generation algorithm: - 1. Extract signals: feature domain keywords (e.g., auth, latency, UX, API), risk indicators ("critical", "must", "compliance"), stakeholder hints ("QA", "review", "security team"), and explicit deliverables ("a11y", "rollback", "contracts"). + 1. Extract signals: feature domain keywords (e.g., auth, latency, UX, API), risk indicators ("critical", "must", "compliance"), stakeholder hints ("QA", "review", "security team"), explicit deliverables ("a11y", "rollback", "contracts"). 2. Cluster signals into candidate focus areas (max 4) ranked by relevance. 3. Identify probable audience & timing (author, reviewer, QA, release) if not explicit. 4. Detect missing dimensions: scope breadth, depth/rigor, risk emphasis, exclusion boundaries, measurable acceptance criteria. - 5. Formulate questions chosen from these archetypes: - - Scope refinement (e.g., "Should this include integration touchpoints with X and Y or stay limited to local module correctness?") - - Risk prioritization (e.g., "Which of these potential risk areas should receive mandatory gating checks?") - - Depth calibration (e.g., "Is this a lightweight pre-commit sanity list or a formal release gate?") - - Audience framing (e.g., "Will this be used by the author only or peers during PR review?") - - Boundary exclusion (e.g., "Should we explicitly exclude performance tuning items this round?") - - Scenario class gap (e.g., "No recovery flows detected—are rollback / partial failure paths in scope?") + 5. Formulate questions from these archetypes: scope refinement, risk prioritization, depth calibration, audience framing, boundary exclusion, scenario class gap (e.g., "No recovery flows detected—are rollback / partial failure paths in scope?"). Question formatting rules: - - If presenting options, generate a compact table with columns: Option | Candidate | Why It Matters - - Limit to A–E options maximum; omit table if a free-form answer is clearer + - If presenting options, generate a compact table with columns Option | Candidate | Why It Matters + - Limit to A–E options maximum; omit the table if a free-form answer is clearer - Never ask the user to restate what they already said - - Avoid speculative categories (no hallucination). If uncertain, ask explicitly: "Confirm whether X belongs in scope." + - Avoid speculative categories (no hallucination) - if uncertain, ask explicitly: "Confirm whether X belongs in scope." Defaults when interaction impossible: - Depth: Standard - - Audience: Reviewer (PR) if code-related; Author otherwise - - Focus: Top 2 relevance clusters + - Audience: Reviewer (PR) if code-related, Author otherwise + - Focus: top 2 relevance clusters - Output the questions (label Q1/Q2/Q3). After answers: if ≥2 scenario classes (Alternate / Exception / Recovery / Non-Functional domain) remain unclear, you MAY ask up to TWO more targeted follow‑ups (Q4/Q5) with a one-line justification each (e.g., "Unresolved recovery path risk"). Do not exceed five total questions. Skip escalation if user explicitly declines more. + Output the questions (label Q1/Q2/Q3). After answers: if ≥2 scenario classes (Alternate / Exception / Recovery / Non-Functional domain) remain unclear, you MAY ask up to TWO more targeted follow-ups (Q4/Q5) with a one-line justification each (e.g., "Unresolved recovery path risk"). Do not exceed five total questions. Skip escalation if user explicitly declines more. 4. **Understand user request**: Combine `$ARGUMENTS` + clarifying answers: - Derive checklist theme (e.g., security, review, deploy, ux) @@ -114,94 +99,39 @@ You **MUST** consider the user input before proceeding (if not empty). - Map focus selections to category scaffolding - Infer any missing context from spec/plan/tasks (do NOT hallucinate) -5. **Load feature context**: Read from FEATURE_DIR: - - spec.md: Feature requirements and scope - - plan.md (if exists): Technical details, dependencies - - tasks.md (if exists): Implementation tasks +5. **Load feature context**: Read from FEATURE_DIR: spec.md (feature requirements and scope), plan.md if exists (technical details, dependencies), tasks.md if exists (implementation tasks). - **Context Loading Strategy**: - - Load only necessary portions relevant to active focus areas (avoid full-file dumping) + Context loading strategy: + - Load only the portions relevant to active focus areas (avoid full-file dumping) - Prefer summarizing long sections into concise scenario/requirement bullets - - Use progressive disclosure: add follow-on retrieval only if gaps detected + - Use progressive disclosure (add follow-on retrieval only if gaps detected) - If source docs are large, generate interim summary items instead of embedding raw text 6. **Generate checklist** - Create "Unit Tests for Requirements": - Create `FEATURE_DIR/checklists/` directory if it doesn't exist - - Generate unique checklist filename: - - Use short, descriptive name based on domain (e.g., `ux.md`, `api.md`, `security.md`) - - Format: `[domain].md` - - File handling behavior: - - If file does NOT exist: Create new file and number items starting from CHK001 - - If file exists: Append new items to existing file, continuing from the last CHK ID (e.g., if last item is CHK015, start new items at CHK016) + - Filename: short, descriptive domain name, format `[domain].md` (e.g., `ux.md`, `api.md`, `security.md`) + - If the file does NOT exist: create it and number items starting from CHK001 + - If it exists: append new items, continuing from the last CHK ID (e.g., if last item is CHK015, start at CHK016) - Never delete or replace existing checklist content - always preserve and append - **CORE PRINCIPLE - Test the Requirements, Not the Implementation**: - Every checklist item MUST evaluate the REQUIREMENTS THEMSELVES for: - - **Completeness**: Are all necessary requirements present? - - **Clarity**: Are requirements unambiguous and specific? - - **Consistency**: Do requirements align with each other? - - **Measurability**: Can requirements be objectively verified? - - **Coverage**: Are all scenarios/edge cases addressed? - - **Category Structure** - Group items by requirement quality dimensions: - - **Requirement Completeness** (Are all necessary requirements documented?) - - **Requirement Clarity** (Are requirements specific and unambiguous?) - - **Requirement Consistency** (Do requirements align without conflicts?) - - **Acceptance Criteria Quality** (Are success criteria measurable?) - - **Scenario Coverage** (Are all flows/cases addressed?) - - **Edge Case Coverage** (Are boundary conditions defined?) - - **Non-Functional Requirements** (Performance, Security, Accessibility, etc. - are they specified?) - - **Dependencies & Assumptions** (Are they documented and validated?) - - **Ambiguities & Conflicts** (What needs clarification?) - - **HOW TO WRITE CHECKLIST ITEMS - "Unit Tests for English"**: - - ❌ **WRONG** (Testing implementation): - - "Verify landing page displays 3 episode cards" - - "Test hover states work on desktop" - - "Confirm logo click navigates home" - - ✅ **CORRECT** (Testing requirements quality): - - "Are the exact number and layout of featured episodes specified?" [Completeness] - - "Is 'prominent display' quantified with specific sizing/positioning?" [Clarity] - - "Are hover state requirements consistent across all interactive elements?" [Consistency] - - "Are keyboard navigation requirements defined for all interactive UI?" [Coverage] - - "Is the fallback behavior specified when logo image fails to load?" [Edge Cases] - - "Are loading states defined for asynchronous episode data?" [Completeness] - - "Does the spec define visual hierarchy for competing UI elements?" [Clarity] - - **ITEM STRUCTURE**: - Each item should follow this pattern: - - Question format asking about requirement quality - - Focus on what's WRITTEN (or not written) in the spec/plan - - Include quality dimension in brackets [Completeness/Clarity/Consistency/etc.] - - Reference spec section `[Spec §X.Y]` when checking existing requirements - - Use `[Gap]` marker when checking for missing requirements - - **EXAMPLES BY QUALITY DIMENSION**: - - Completeness: - - "Are error handling requirements defined for all API failure modes? [Gap]" - - "Are accessibility requirements specified for all interactive elements? [Completeness]" - - "Are mobile breakpoint requirements defined for responsive layouts? [Gap]" - - Clarity: - - "Is 'fast loading' quantified with specific timing thresholds? [Clarity, Spec §NFR-2]" - - "Are 'related episodes' selection criteria explicitly defined? [Clarity, Spec §FR-5]" - - "Is 'prominent' defined with measurable visual properties? [Ambiguity, Spec §FR-4]" + **CORE PRINCIPLE - Test the Requirements, Not the Implementation**: every checklist item MUST evaluate the REQUIREMENTS THEMSELVES for: + - **Completeness** (all necessary requirements present?) + - **Clarity** (unambiguous and specific?) + - **Consistency** (requirements align with each other?) + - **Measurability** (objectively verifiable?) + - **Coverage** (all scenarios/edge cases addressed?) - Consistency: - - "Do navigation requirements align across all pages? [Consistency, Spec §FR-10]" - - "Are card component requirements consistent between landing and detail pages? [Consistency]" + See Anti-Examples below for wrong vs correct item style. - Coverage: - - "Are requirements defined for zero-state scenarios (no episodes)? [Coverage, Edge Case]" - - "Are concurrent user interaction scenarios addressed? [Coverage, Gap]" - - "Are requirements specified for partial data loading failures? [Coverage, Exception Flow]" + **Category Structure** - group items by requirement quality dimension: Requirement Completeness, Requirement Clarity, Requirement Consistency, Acceptance Criteria Quality (success criteria measurable?), Scenario Coverage (all flows/cases addressed?), Edge Case Coverage (boundary conditions defined?), Non-Functional Requirements (performance, security, accessibility, etc. - specified?), Dependencies & Assumptions (documented and validated?), Ambiguities & Conflicts (what needs clarification?). - Measurability: - - "Are visual hierarchy requirements measurable/testable? [Acceptance Criteria, Spec §FR-1]" - - "Can 'balanced visual weight' be objectively verified? [Measurability, Spec §FR-2]" + **ITEM STRUCTURE** - each item: + - Question format asking about requirement quality + - Focus on what's WRITTEN (or not written) in the spec/plan + - Include the quality dimension in brackets [Completeness/Clarity/Consistency/etc.] + - Reference spec section `[Spec §X.Y]` when checking existing requirements + - Use the `[Gap]` marker when checking for missing requirements + - Example: "Is 'fast loading' quantified with specific timing thresholds? [Clarity, Spec §NFR-2]" **Scenario Classification & Coverage** (Requirements Quality Focus): - Check if requirements exist for: Primary, Alternate, Exception/Error, Recovery, Non-Functional scenarios @@ -209,98 +139,32 @@ You **MUST** consider the user input before proceeding (if not empty). - If scenario class missing: "Are [scenario type] requirements intentionally excluded or missing? [Gap]" - Include resilience/rollback when state mutation occurs: "Are rollback requirements defined for migration failures? [Gap]" - **Traceability Requirements**: - - MINIMUM: ≥80% of items MUST include at least one traceability reference - - Each item should reference: spec section `[Spec §X.Y]`, or use markers: `[Gap]`, `[Ambiguity]`, `[Conflict]`, `[Assumption]` - - If no ID system exists: "Is a requirement & acceptance criteria ID scheme established? [Traceability]" - - **Surface & Resolve Issues** (Requirements Quality Problems): - Ask questions about the requirements themselves: - - Ambiguities: "Is the term 'fast' quantified with specific metrics? [Ambiguity, Spec §NFR-1]" - - Conflicts: "Do navigation requirements conflict between §FR-10 and §FR-10a? [Conflict]" - - Assumptions: "Is the assumption of 'always available podcast API' validated? [Assumption]" - - Dependencies: "Are external podcast API requirements documented? [Dependency, Gap]" - - Missing definitions: "Is 'visual hierarchy' defined with measurable criteria? [Gap]" - - **Content Consolidation**: - - Soft cap: If raw candidate items > 40, prioritize by risk/impact - - Merge near-duplicates checking the same requirement aspect - - If >5 low-impact edge cases, create one item: "Are edge cases X, Y, Z addressed in requirements? [Coverage]" - - **🚫 ABSOLUTELY PROHIBITED** - These make it an implementation test, not a requirements test: - - ❌ Any item starting with "Verify", "Test", "Confirm", "Check" + implementation behavior - - ❌ References to code execution, user actions, system behavior - - ❌ "Displays correctly", "works properly", "functions as expected" - - ❌ "Click", "navigate", "render", "load", "execute" - - ❌ Test cases, test plans, QA procedures - - ❌ Implementation details (frameworks, APIs, algorithms) - - **✅ REQUIRED PATTERNS** - These test requirements quality: - - ✅ "Are [requirement type] defined/specified/documented for [scenario]?" - - ✅ "Is [vague term] quantified/clarified with specific criteria?" - - ✅ "Are requirements consistent between [section A] and [section B]?" - - ✅ "Can [requirement] be objectively measured/verified?" - - ✅ "Are [edge cases/scenarios] addressed in requirements?" - - ✅ "Does the spec define [missing aspect]?" + **Traceability Requirements**: MINIMUM ≥80% of items MUST include at least one traceability reference - spec section `[Spec §X.Y]` or markers `[Gap]`, `[Ambiguity]`, `[Conflict]`, `[Assumption]`. If no ID system exists: "Is a requirement & acceptance criteria ID scheme established? [Traceability]" -7. **Structure Reference**: Generate the checklist following the canonical template in `templates/checklist-template.md` for title, meta section, category headings, and ID formatting. If template is unavailable, use: H1 title, purpose/created meta lines, `##` category sections containing `- [ ] CHK### ` lines with globally incrementing IDs starting at CHK001. + **Surface & Resolve Issues** - ask questions about the requirements themselves: ambiguities (vague terms quantified with specific metrics? [Ambiguity]), conflicts (do sections contradict each other? [Conflict]), assumptions (validated? [Assumption]), dependencies (documented? [Dependency, Gap]), missing definitions (defined with measurable criteria? [Gap]). -8. **Report**: Output full path to checklist file, item count, and summarize whether the run created a new file or appended to an existing one. Summarize: - - Focus areas selected - - Depth level - - Actor/timing - - Any explicit user-specified must-have items incorporated + **Content Consolidation**: soft cap - if raw candidate items > 40, prioritize by risk/impact; merge near-duplicates checking the same requirement aspect; if >5 low-impact edge cases, create one item: "Are edge cases X, Y, Z addressed in requirements? [Coverage]" -**Important**: Each `__SPECKIT_COMMAND_CHECKLIST__` command invocation uses a short, descriptive checklist filename and either creates a new file or appends to an existing one. This allows: + **🚫 ABSOLUTELY PROHIBITED** - these make it an implementation test, not a requirements test: any item starting with "Verify", "Test", "Confirm", "Check" + implementation behavior; references to code execution, user actions, system behavior; "Displays correctly", "works properly", "functions as expected"; "Click", "navigate", "render", "load", "execute"; test cases, test plans, QA procedures; implementation details (frameworks, APIs, algorithms). -- Multiple checklists of different types (e.g., `ux.md`, `test.md`, `security.md`) -- Simple, memorable filenames that indicate checklist purpose -- Easy identification and navigation in the `checklists/` folder + **✅ REQUIRED PATTERNS** - these test requirements quality: "Are [requirement type] defined/specified/documented for [scenario]?"; "Is [vague term] quantified/clarified with specific criteria?"; "Are requirements consistent between [section A] and [section B]?"; "Can [requirement] be objectively measured/verified?"; "Are [edge cases/scenarios] addressed in requirements?"; "Does the spec define [missing aspect]?" -To avoid clutter, use descriptive types and clean up obsolete checklists when done. +7. **Structure Reference**: Generate the checklist following the canonical template in `templates/checklist-template.md` for title, meta section, category headings, and ID formatting. If template is unavailable, use: H1 title, purpose/created meta lines, `##` category sections containing `- [ ] CHK### ` lines with globally incrementing IDs starting at CHK001. -## Example Checklist Types & Sample Items +8. **Report**: Output full path to checklist file, item count, and whether the run created a new file or appended to an existing one. Summarize: focus areas selected, depth level, actor/timing, any explicit user-specified must-have items incorporated. -**UX Requirements Quality:** `ux.md` +**Important**: Each `__SPECKIT_COMMAND_CHECKLIST__` command invocation uses a short, descriptive checklist filename and either creates a new file or appends to an existing one, allowing multiple checklists of different types (e.g., `ux.md`, `test.md`, `security.md`) with simple, memorable names that are easy to identify in the `checklists/` folder. To avoid clutter, use descriptive types and clean up obsolete checklists when done. -Sample items (testing the requirements, NOT the implementation): +## Example Checklist Types & Sample Items + +Sample items in every domain test the requirements, NOT the implementation. **UX Requirements Quality** (`ux.md`): - "Are visual hierarchy requirements defined with measurable criteria? [Clarity, Spec §FR-1]" -- "Is the number and positioning of UI elements explicitly specified? [Completeness, Spec §FR-1]" - "Are interaction state requirements (hover, focus, active) consistently defined? [Consistency]" - "Are accessibility requirements specified for all interactive elements? [Coverage, Gap]" - "Is fallback behavior defined when images fail to load? [Edge Case, Gap]" -- "Can 'prominent display' be objectively measured? [Measurability, Spec §FR-4]" - -**API Requirements Quality:** `api.md` - -Sample items: - -- "Are error response formats specified for all failure scenarios? [Completeness]" -- "Are rate limiting requirements quantified with specific thresholds? [Clarity]" -- "Are authentication requirements consistent across all endpoints? [Consistency]" -- "Are retry/timeout requirements defined for external dependencies? [Coverage, Gap]" -- "Is versioning strategy documented in requirements? [Gap]" - -**Performance Requirements Quality:** `performance.md` -Sample items: - -- "Are performance requirements quantified with specific metrics? [Clarity]" -- "Are performance targets defined for all critical user journeys? [Coverage]" -- "Are performance requirements under different load conditions specified? [Completeness]" -- "Can performance requirements be objectively measured? [Measurability]" -- "Are degradation requirements defined for high-load scenarios? [Edge Case, Gap]" - -**Security Requirements Quality:** `security.md` - -Sample items: - -- "Are authentication requirements specified for all protected resources? [Coverage]" -- "Are data protection requirements defined for sensitive information? [Completeness]" -- "Is the threat model documented and requirements aligned to it? [Traceability]" -- "Are security requirements consistent with compliance obligations? [Consistency]" -- "Are security failure/breach response requirements defined? [Gap, Exception Flow]" +Other domains follow the same pattern, e.g., **API** (`api.md`), **Performance** (`performance.md`), **Security** (`security.md`). ## Anti-Examples: What NOT To Do @@ -324,25 +188,15 @@ Sample items: - [ ] CHK006 - Can "visual hierarchy" requirements be objectively measured? [Measurability, Spec §FR-001] ``` -**Key Differences:** - -- Wrong: Tests if the system works correctly -- Correct: Tests if the requirements are written correctly -- Wrong: Verification of behavior -- Correct: Validation of requirement quality -- Wrong: "Does it do X?" -- Correct: "Is X clearly specified?" +**Key Difference**: wrong items verify behavior ("Does it do X?" - does the system work correctly); correct items validate requirement quality ("Is X clearly specified?" - are the requirements written correctly). ## Post-Execution Checks **Check for extension hooks (after checklist generation)**: -Check if `.specify/extensions.yml` exists in the project root. -- If it exists, read it and look for entries under the `hooks.after_checklist` key +- If `.specify/extensions.yml` exists in the project root, read it and look for entries under the `hooks.after_checklist` key - If the YAML cannot be parsed or is invalid, skip hook checking silently and continue normally -- Filter out hooks where `enabled` is explicitly `false`. Treat hooks without an `enabled` field as enabled by default. -- For each remaining hook, do **not** attempt to interpret or evaluate hook `condition` expressions: - - If the hook has no `condition` field, or it is null/empty, treat the hook as executable - - If the hook defines a non-empty `condition`, skip the hook and leave condition evaluation to the HookExecutor implementation +- Filter out hooks where `enabled` is explicitly `false`; treat hooks without an `enabled` field as enabled by default +- For each remaining hook, do **not** attempt to interpret or evaluate hook `condition` expressions: if the hook has no `condition` field (or it is null/empty), treat it as executable; if it defines a non-empty `condition`, skip it and leave condition evaluation to the HookExecutor implementation - For each executable hook, output the following based on its `optional` flag: - **Optional hook** (`optional: true`): ```