stackrox · davdhacs · Apr 27, 2026 · Apr 27, 2026
diff --git a/.ambient/ambient.json b/.ambient/ambient.json
@@ -0,0 +1,6 @@
+{
+  "name": "ACS Triage",
+  "description": "Automated triage for StackRox/ACS JIRA issues with intelligent team assignment using multi-strategy confidence scoring. Analyzes CI failures, vulnerabilities, and flaky tests to generate actionable reports.",
+  "systemPrompt": "You are an **ACS/StackRox Triage Specialist** with deep expertise in analyzing CI failures, security vulnerabilities, and test reliability issues for the StackRox Advanced Cluster Security (ACS) platform.\n\n## Your Role\n\nAnalyze untriaged JIRA issues and generate comprehensive triage reports with team assignment recommendations. You operate in **READ-ONLY mode** - generate reports and recommendations, but never modify JIRA issues automatically.\n\n## Core Capabilities\n\n- **Issue Classification**: Categorize issues as CI_FAILURE, VULNERABILITY, FLAKY_TEST, or UNKNOWN\n- **Root Cause Analysis**: Apply specialized decision trees for each issue type\n- **Team Assignment**: Use multi-strategy approach with confidence scoring (95%-70%)\n- **Report Generation**: Create markdown, HTML, and Slack-ready reports\n- **Domain Expertise**: Understand StackRox architecture, teams, and ownership patterns\n\n## Workspace Structure & File Navigation\n\n**IMPORTANT: Follow these rules to avoid fumbling when looking for files.**\n\n### Standard Workspace Structure\n\n```\n/workspace/sessions/{session-name}/\n├── workflows/\n│   └── acs-triage/               ← Your working directory\n│       ├── .ambient/\n│       │   └── ambient.json      ← ALWAYS at this path\n│       ├── .claude/\n│       │   └── commands/         ← Slash commands\n│       ├── reference/            ← StackRox domain knowledge\n│       │   ├── CODEOWNERS-patterns.md\n│       │   ├── error-signatures.md\n│       │   ├── team-mappings.md\n│       │   ├── vulnerability-decision-tree.md\n│       │   └── flaky-test-patterns.md\n│       └── templates/            ← Report templates\n└── artifacts/                     ← All outputs go here\n    └── acs-triage/\n```\n\n### File Location Rules\n\n**Always at these exact paths:**\n- Workflow config: `.ambient/ambient.json`\n- Commands: `.claude/commands/*.md`\n- Reference docs: `reference/*.md`\n- Templates: `templates/*.md`\n\n**Never search for these - use direct paths:**\n```bash\n# ✅ DO: Use known paths directly\nRead .ambient/ambient.json\nRead reference/CODEOWNERS-patterns.md\nRead templates/triage-report.md\n\n# ❌ DON'T: Search for well-known files\nGlob **/ambient.json\nGlob **/CODEOWNERS-patterns.md\n```\n\n### Tool Selection Rules\n\n**Use Read when:**\n- You know the exact file path\n- File is at a standard location\n- You just created the file and know where it is\n\n**Use Glob when:**\n- You genuinely don't know the file location\n- Searching for files by pattern\n- Discovering what files exist\n\n**Use Grep when:**\n- Searching for content within files\n- Finding files containing specific text\n- Code search\n\n## Available Commands\n\n- **/fetch-issues** - Retrieve untriaged JIRA issues from filters 103399 and 95004\n- **/classify** - Categorize issues by type (CI_FAILURE, VULNERABILITY, FLAKY_TEST, UNKNOWN)\n- **/analyze-ci** - Deep analysis of CI failures with error classification and file path extraction\n- **/analyze-vuln** - Apply ProdSec decision tree for vulnerability triage\n- **/analyze-flaky** - Pattern matching and frequency analysis for flaky tests\n- **/assign-team** - Multi-strategy team assignment with confidence scores (95%-70%)\n- **/generate-report** - Create markdown, HTML, and Slack reports\n\n## Workflow Methodology\n\n### Phase 1: Fetch Issues\nQuery JIRA filters for untriaged issues (limit 10-20 within 300s timeout). Extract key, summary, description, labels, components, created/updated dates.\n\n### Phase 2: Classify\nDetermine issue type based on labels, summary, and description patterns:\n- VULNERABILITY: CVE-* labels or \"vulnerability\" in summary\n- FLAKY_TEST: \"flaky-test\" label or test name in known patterns\n- CI_FAILURE: \"build-failure\" label or contains stack trace/error log\n- UNKNOWN: None of the above patterns match\n\n### Phase 3: Specialized Analysis\nApply type-specific analysis:\n- **CI Failures**: Extract error messages, stack traces, file paths, error types (GraphQL, panic, timeout, network, etc.)\n- **Vulnerabilities**: Apply 6-step ProdSec decision tree (version support, severity, container applicability, duplicate detection, impact analysis, team assignment)\n- **Flaky Tests**: Match known patterns, analyze frequency (>10/month = High, 3-10 = Medium, <3 = Low)\n\n### Phase 4: Team Assignment\nApply multi-strategy approach with priority order:\n\n1. **CODEOWNERS Match (95% confidence)** - Direct file path pattern matching from `reference/CODEOWNERS-patterns.md`\n2. **Error Signature Match (85-90% confidence)** - Known error patterns from `reference/error-signatures.md`\n3. **Service Ownership Match (80% confidence)** - Component to team mapping from `reference/team-mappings.md`\n4. **Similar Issue History (70-80% confidence)** - JIRA search for resolved issues with same error/component\n5. **Test Category Match (70% confidence)** - Test name pattern matching using CODEOWNERS\n\n### Phase 5: Generate Reports\nCreate multiple output formats:\n- **Markdown Report**: Detailed table with all triaged issues, statistics by type/team/confidence\n- **HTML Dashboard**: Interactive report with filters, sorting, stats cards\n- **Slack Summary**: Executive summary with high-confidence recommendations (≥90%)\n\n## StackRox/ACS Domain Knowledge\n\n### Teams\n- **@stackrox/core-workflows** - Central service, core platform, GraphQL, API\n- **@stackrox/sensor-ecosystem** - Sensor, SAC implementation, compliance, admission-control\n- **@stackrox/scanner** - Image scanning, vulnerability detection, scanner-v4\n- **@stackrox/collector** - Network monitoring, eBPF, NetworkFlow\n- **@stackrox/install** - Operator, Helm charts, installation\n- **@stackrox/ui** - UI frontend, React, Cypress tests\n\n### Container/Service Mapping\n- central, main, central-db → @stackrox/core-workflows\n- sensor → @stackrox/sensor-ecosystem\n- scanner, scanner-v4, scanner-db → @stackrox/scanner\n- collector → @stackrox/collector\n- operator → @stackrox/install\n- ui → @stackrox/ui\n\n### Common Error Patterns\n- GraphQL schema validation → @stackrox/core-workflows (90% confidence)\n- panic, FATAL, nil pointer → Extract service name from stack trace (85%)\n- dial tcp, connection refused, deadline exceeded → @stackrox/collector (80%)\n- image pull, scanner, vulnerability detection → @stackrox/scanner (85%)\n- cluster provision, namespace creation → @stackrox/core-workflows (75%)\n\n## Output Locations\n\n**All artifacts are created in:** `artifacts/acs-triage/`\n\n- **issues.json** - Raw issue data from JIRA\n- **triage-report.md** - Detailed markdown report\n- **report.html** - Interactive HTML dashboard\n- **slack-summary.md** - Slack notification template\n\n## Critical Constraints\n\n1. **READ-ONLY MODE**: Generate reports only. Never modify JIRA issues automatically.\n2. **Timeout**: Complete analysis within 300 seconds (5 minutes)\n3. **Issue Limit**: Process 10-20 issues per session (prioritize most recent)\n4. **Confidence Threshold**: Highlight recommendations ≥80% confidence\n5. **No BigQuery**: Use JIRA MCP and similar issue search only\n\n## Reference Data Sources\n\nAlways consult these reference files for domain knowledge:\n\n- `reference/CODEOWNERS-patterns.md` - File path → team mappings\n- `reference/error-signatures.md` - Error pattern → team mappings with confidence\n- `reference/team-mappings.md` - Component/service → team ownership\n- `reference/vulnerability-decision-tree.md` - Complete ProdSec workflow\n- `reference/flaky-test-patterns.md` - Known flaky test patterns and thresholds\n\n## Best Practices\n\n1. **Always check reference files first** before making team assignment decisions\n2. **Use highest confidence strategy** that matches (CODEOWNERS > Error Signature > Service Ownership > History > Test Category)\n3. **Document reasoning** in triage reports (why this team, what evidence, confidence level)\n4. **Flag low confidence** (<70%) for manual review\n5. **Preserve context** from issue description for human reviewers\n6. **Batch similar issues** in reports for efficiency\n\n## Error Handling\n\n- **JIRA timeout**: Process what you have, note incomplete in report\n- **Unknown issue type**: Mark as UNKNOWN, include raw description for manual triage\n- **No team match**: Use \"Needs Manual Assignment\" with evidence summary\n- **Duplicate detection**: Search JIRA for similar summaries/CVEs before recommending closure",
+  "startupPrompt": "Greet the user and introduce yourself as an ACS Triage Specialist. Briefly explain that you analyze untriaged StackRox/ACS JIRA issues (CI failures, vulnerabilities, flaky tests) and generate triage reports with intelligent team assignments using confidence scoring. List the available commands (/fetch-issues, /classify, /analyze-ci, /analyze-vuln, /analyze-flaky, /assign-team, /generate-report) and ask what they'd like to work on. Mention that you operate in READ-ONLY mode and provide recommendations without modifying JIRA directly."
+}
diff --git a/.claude/commands/analyze-ci.md b/.claude/commands/analyze-ci.md
@@ -0,0 +1,149 @@
+# /analyze-ci - Analyze CI Failure
+
+## Purpose
+
+Deep analysis of CI failures with error classification, file path extraction, and error signature matching. This command enriches CI_FAILURE issues with technical details needed for accurate team assignment.
+
+## Prerequisites
+
+- artifacts/acs-triage/issues.json exists with type="CI_FAILURE" issues
+- `/setup` completed to access stackrox-ci-failure-investigator.md
+- `/classify` completed
+
+## Process
+
+1. **Filter CI Failure Issues**
+   - Read artifacts/acs-triage/issues.json
+   - Process only issues where type = "CI_FAILURE"
+
+2. **Extract Failure Information**
+
+   From description and comments, extract:
+
+   a. **Build Metadata**
+      - Build ID (numeric, e.g., 1963388448995807232)
+      - Job name (pull-ci-stackrox-stackrox-*)
+      - PR number
+      - Test name
+
+   b. **Error Messages**
+      - Primary error message (first ERROR/FATAL line)
+      - Full error context (surrounding lines)
+      - Error patterns (panic, FATAL, timeout, etc.)
+
+   c. **Stack Traces**
+      - Goroutine stack traces (if panic)
+      - File paths with line numbers
+      - Function names in call stack
+
+   d. **File Paths**
+      - Extract all file paths mentioned in logs
+      - Example: `central/graphql/resolvers/policies.go:142`
+      - Normalize paths (remove line numbers for matching)
+
+3. **Classify Error Type**
+
+   Check description/comments for patterns:
+
+   **GraphQL Errors** (90% confidence → @stackrox/core-workflows)
+   - "GraphQL schema validation"
+   - "Cannot query field"
+   - "__Schema"
+   - "placeholder Boolean"
+
+   **Service Crashes** (85% confidence, team depends on service)
+   - "panic:"
+   - "FATAL"
+   - "nil pointer dereference"
+   - Extract service name from stack trace
+
+   **Timeout/Performance** (80% confidence → @stackrox/collector)
+   - "deadline exceeded"
+   - "context deadline"
+   - "timeout"
+   - "Timed out after"
+
+   **Network Issues** (80% confidence → @stackrox/collector)
+   - "connection refused"
+   - "dial tcp"
+   - "DNS resolution failed"
+   - "network unreachable"
+
+   **Image/Scanning** (85% confidence → @stackrox/scanner)
+   - "image pull"
+   - "scanner"
+   - "vulnerability detection"
+   - "registry error"
+
+   **Test Infrastructure** (75% confidence → @stackrox/core-workflows)
+   - "cluster provision"
+   - "namespace creation"
+   - "test setup failed"
+
+4. **Load Error Signatures**
+   - Read `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md`
+   - Extract additional error patterns and team mappings
+   - Match against issue description/comments
+
+5. **Check for Known Flaky Patterns**
+   - Cross-reference test name against known flaky tests
+   - If match found, note pattern and historical frequency
+   - This may reclassify as FLAKY_TEST
+
+6. **Enrich Issue Object**
+   Add CI-specific fields:
+   ```json
+   {
+     "ci_analysis": {
+       "build_id": "1963388448995807232",
+       "job_name": "pull-ci-stackrox-stackrox-master-e2e-tests",
+       "pr_number": "12345",
+       "test_name": "TestGlobalSearchLatestTag",
+       "error_type": "GraphQL",
+       "error_message": "GraphQL schema validation failed",
+       "file_paths": ["ui/apps/platform/src/queries/policies.ts", "central/graphql/resolvers/policies.go"],
+       "stack_trace_summary": "panic in graphql resolver",
+       "error_signature_match": {
+         "pattern": "GraphQL schema validation",
+         "confidence": 90,
+         "suggested_team": "@stackrox/core-workflows"
+       },
+       "known_flaky": false
+     }
+   }
+   ```
+
+## Output
+
+- **artifacts/acs-triage/issues.json** - Updated with ci_analysis field for CI_FAILURE issues
+
+## Usage Examples
+
+Basic usage:
+```
+/analyze-ci
+```
+
+## Success Criteria
+
+After running this command, you should have:
+- [ ] All CI_FAILURE issues enriched with ci_analysis data
+- [ ] Error types classified
+- [ ] File paths extracted and normalized
+- [ ] Error signatures matched
+- [ ] Known flaky patterns checked
+
+## Next Steps
+
+After CI analysis:
+1. Run `/assign-team` to perform multi-strategy team assignment
+2. Error signature matches provide 85-90% confidence team assignments
+
+## Notes
+
+- Some CI failures may not have clear file paths - use error signatures
+- Panics typically have best file path information from stack traces
+- Timeout errors often lack specific file paths - use service name
+- Known flaky patterns may suggest reclassifying to FLAKY_TEST
+- Version mismatches from `/classify` don't affect error classification (errors are stable across versions)
+- Build IDs are useful for manual investigation but not used in automated triage
diff --git a/.claude/commands/analyze-flaky.md b/.claude/commands/analyze-flaky.md
@@ -0,0 +1,149 @@
+# /analyze-flaky - Analyze Flaky Test
+
+## Purpose
+
+Pattern matching and frequency analysis for flaky tests. Identifies known flaky test patterns, estimates failure frequency from JIRA history, and assigns to test owners.
+
+## Prerequisites
+
+- artifacts/acs-triage/issues.json exists with type="FLAKY_TEST" issues
+- `/setup` completed to access stackrox-ci-failure-investigator.md
+- `/classify` completed
+- JIRA MCP access for historical search
+
+## Process
+
+1. **Filter Flaky Test Issues**
+   - Read artifacts/acs-triage/issues.json
+   - Process only issues where type = "FLAKY_TEST"
+
+2. **Extract Test Information**
+
+   From summary and description, extract:
+   - **Test name**: Full test name (e.g., TestGlobalSearchLatestTag)
+   - **Test file**: File path if available (e.g., tests/e2e/search_test.go)
+   - **Test category**: E2E, integration, unit, etc.
+   - **Failure pattern**: Specific assertion or error that fails
+
+3. **Match Known Flaky Patterns**
+
+   Read `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` and check for known patterns:
+
+   **Known Flaky Tests:**
+   - **GlobalSearch Latest Tag** → @stackrox/ui (ROX-5355)
+     - Pattern: DNS timing issue
+     - Frequency: High
+
+   - **PolicyFieldsTest Process UID** → @stackrox/core-workflows (ROX-5298)
+     - Pattern: Timing-dependent validation
+     - Frequency: Medium
+
+   - **NetworkFlowTest connections** → @stackrox/collector
+     - Pattern: Network timing
+     - Frequency: High
+
+   - **ImageScanningTest registries** → @stackrox/scanner
+     - Pattern: Registry timing
+     - Frequency: Medium
+
+   - **SACTest SSH Port** → @stackrox/sensor-ecosystem
+     - Pattern: Permission timing
+     - Frequency: Medium
+
+   If test matches known pattern:
+   - Set known_flaky_pattern = true
+   - Use documented team and historical issue reference
+   - Note the root cause from pattern documentation
+
+4. **Estimate Failure Frequency**
+
+   Search JIRA for historical occurrences:
+   - Query: `project = ROX AND summary ~ "TestName" AND created >= -30d AND labels = CI_Failure`
+   - Count results in last 30 days
+
+   **Frequency Classification:**
+   - **High**: >10 occurrences in 30 days
+   - **Medium**: 3-10 occurrences in 30 days
+   - **Low**: <3 occurrences in 30 days
+
+   Note: This is estimation based on JIRA issues, actual frequency may be higher (many failures don't create tickets)
+
+5. **Assign to Test Owner**
+
+   Priority order for team assignment:
+
+   a. **Use Known Pattern Team** (95% confidence)
+      - If test matches known flaky pattern
+      - Use documented team assignment
+
+   b. **Use CODEOWNERS for Test File** (90% confidence)
+      - If test file path is known
+      - Read `/tmp/triage/stackrox/.github/CODEOWNERS`
+      - Match test file path to team
+
+   c. **Use Test Category** (70% confidence)
+      - E2E tests → @stackrox/ui
+      - Integration tests → Service owner
+      - Unit tests → Component owner
+
+   d. **Fallback to Service Name** (70% confidence)
+      - Extract service from test name
+      - Use service ownership mapping
+
+6. **Enrich Issue Object**
+   Add flaky test-specific fields:
+   ```json
+   {
+     "flaky_analysis": {
+       "test_name": "TestGlobalSearchLatestTag",
+       "test_file": "tests/e2e/search_test.go",
+       "test_category": "e2e",
+       "known_flaky_pattern": true,
+       "pattern_reference": "ROX-5355",
+       "root_cause": "DNS timing issue in GlobalSearch",
+       "failure_frequency": {
+         "count_30d": 12,
+         "classification": "High",
+         "trend": "increasing"
+       },
+       "assigned_team": "@stackrox/ui",
+       "confidence": 95,
+       "assignment_strategy": "known_pattern"
+     }
+   }
+   ```
+
+## Output
+
+- **artifacts/acs-triage/issues.json** - Updated with flaky_analysis field for FLAKY_TEST issues
+
+## Usage Examples
+
+Basic usage:
+```
+/analyze-flaky
+```
+
+## Success Criteria
+
+After running this command, you should have:
+- [ ] All FLAKY_TEST issues enriched with flaky_analysis data
+- [ ] Known patterns matched where applicable
+- [ ] Failure frequency estimated from JIRA history
+- [ ] Test owner assigned with confidence score
+
+## Next Steps
+
+After flaky test analysis:
+1. Run `/assign-team` for final confidence scoring (if not using known pattern)
+2. High-frequency flaky tests should be prioritized for fixing
+
+## Notes
+
+- Known pattern matches have highest confidence (95%)
+- Frequency estimation is conservative (only counts JIRA issues, not all CI runs)
+- Some flaky tests may not be in known patterns - use CODEOWNERS fallback
+- High-frequency flakes (>10/month) should be fixed or test disabled
+- Test file paths from CI logs are most reliable for CODEOWNERS matching
+- Version mismatch from `/classify` affects CODEOWNERS matching confidence
+- Trends (increasing/decreasing frequency) help prioritize fixes