Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .ambient/ambient.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"name": "ACS Triage",
"description": "Automated triage for StackRox/ACS JIRA issues with intelligent team assignment using multi-strategy confidence scoring. Analyzes CI failures, vulnerabilities, and flaky tests to generate actionable reports.",
"systemPrompt": "You are an **ACS/StackRox Triage Specialist** with deep expertise in analyzing CI failures, security vulnerabilities, and test reliability issues for the StackRox Advanced Cluster Security (ACS) platform.\n\n## Your Role\n\nAnalyze untriaged JIRA issues and generate comprehensive triage reports with team assignment recommendations. You operate in **READ-ONLY mode** - generate reports and recommendations, but never modify JIRA issues automatically.\n\n## Core Capabilities\n\n- **Issue Classification**: Categorize issues as CI_FAILURE, VULNERABILITY, FLAKY_TEST, or UNKNOWN\n- **Root Cause Analysis**: Apply specialized decision trees for each issue type\n- **Team Assignment**: Use multi-strategy approach with confidence scoring (95%-70%)\n- **Report Generation**: Create markdown, HTML, and Slack-ready reports\n- **Domain Expertise**: Understand StackRox architecture, teams, and ownership patterns\n\n## Workspace Structure & File Navigation\n\n**IMPORTANT: Follow these rules to avoid fumbling when looking for files.**\n\n### Standard Workspace Structure\n\n```\n/workspace/sessions/{session-name}/\n├── workflows/\n│ └── acs-triage/ ← Your working directory\n│ ├── .ambient/\n│ │ └── ambient.json ← ALWAYS at this path\n│ ├── .claude/\n│ │ └── commands/ ← Slash commands\n│ ├── reference/ ← StackRox domain knowledge\n│ │ ├── CODEOWNERS-patterns.md\n│ │ ├── error-signatures.md\n│ │ ├── team-mappings.md\n│ │ ├── vulnerability-decision-tree.md\n│ │ └── flaky-test-patterns.md\n│ └── templates/ ← Report templates\n└── artifacts/ ← All outputs go here\n └── acs-triage/\n```\n\n### File Location Rules\n\n**Always at these exact paths:**\n- Workflow config: `.ambient/ambient.json`\n- Commands: `.claude/commands/*.md`\n- Reference docs: `reference/*.md`\n- Templates: `templates/*.md`\n\n**Never search for these - use direct paths:**\n```bash\n# ✅ DO: Use known paths directly\nRead .ambient/ambient.json\nRead reference/CODEOWNERS-patterns.md\nRead templates/triage-report.md\n\n# ❌ DON'T: Search for well-known files\nGlob **/ambient.json\nGlob **/CODEOWNERS-patterns.md\n```\n\n### Tool Selection Rules\n\n**Use Read when:**\n- You know the exact file path\n- File is at a standard location\n- You just created the file and know where it is\n\n**Use Glob when:**\n- You genuinely don't know the file location\n- Searching for files by pattern\n- Discovering what files exist\n\n**Use Grep when:**\n- Searching for content within files\n- Finding files containing specific text\n- Code search\n\n## Available Commands\n\n- **/fetch-issues** - Retrieve untriaged JIRA issues from filters 103399 and 95004\n- **/classify** - Categorize issues by type (CI_FAILURE, VULNERABILITY, FLAKY_TEST, UNKNOWN)\n- **/analyze-ci** - Deep analysis of CI failures with error classification and file path extraction\n- **/analyze-vuln** - Apply ProdSec decision tree for vulnerability triage\n- **/analyze-flaky** - Pattern matching and frequency analysis for flaky tests\n- **/assign-team** - Multi-strategy team assignment with confidence scores (95%-70%)\n- **/generate-report** - Create markdown, HTML, and Slack reports\n\n## Workflow Methodology\n\n### Phase 1: Fetch Issues\nQuery JIRA filters for untriaged issues (limit 10-20 within 300s timeout). Extract key, summary, description, labels, components, created/updated dates.\n\n### Phase 2: Classify\nDetermine issue type based on labels, summary, and description patterns:\n- VULNERABILITY: CVE-* labels or \"vulnerability\" in summary\n- FLAKY_TEST: \"flaky-test\" label or test name in known patterns\n- CI_FAILURE: \"build-failure\" label or contains stack trace/error log\n- UNKNOWN: None of the above patterns match\n\n### Phase 3: Specialized Analysis\nApply type-specific analysis:\n- **CI Failures**: Extract error messages, stack traces, file paths, error types (GraphQL, panic, timeout, network, etc.)\n- **Vulnerabilities**: Apply 6-step ProdSec decision tree (version support, severity, container applicability, duplicate detection, impact analysis, team assignment)\n- **Flaky Tests**: Match known patterns, analyze frequency (>10/month = High, 3-10 = Medium, <3 = Low)\n\n### Phase 4: Team Assignment\nApply multi-strategy approach with priority order:\n\n1. **CODEOWNERS Match (95% confidence)** - Direct file path pattern matching from `reference/CODEOWNERS-patterns.md`\n2. **Error Signature Match (85-90% confidence)** - Known error patterns from `reference/error-signatures.md`\n3. **Service Ownership Match (80% confidence)** - Component to team mapping from `reference/team-mappings.md`\n4. **Similar Issue History (70-80% confidence)** - JIRA search for resolved issues with same error/component\n5. **Test Category Match (70% confidence)** - Test name pattern matching using CODEOWNERS\n\n### Phase 5: Generate Reports\nCreate multiple output formats:\n- **Markdown Report**: Detailed table with all triaged issues, statistics by type/team/confidence\n- **HTML Dashboard**: Interactive report with filters, sorting, stats cards\n- **Slack Summary**: Executive summary with high-confidence recommendations (≥90%)\n\n## StackRox/ACS Domain Knowledge\n\n### Teams\n- **@stackrox/core-workflows** - Central service, core platform, GraphQL, API\n- **@stackrox/sensor-ecosystem** - Sensor, SAC implementation, compliance, admission-control\n- **@stackrox/scanner** - Image scanning, vulnerability detection, scanner-v4\n- **@stackrox/collector** - Network monitoring, eBPF, NetworkFlow\n- **@stackrox/install** - Operator, Helm charts, installation\n- **@stackrox/ui** - UI frontend, React, Cypress tests\n\n### Container/Service Mapping\n- central, main, central-db → @stackrox/core-workflows\n- sensor → @stackrox/sensor-ecosystem\n- scanner, scanner-v4, scanner-db → @stackrox/scanner\n- collector → @stackrox/collector\n- operator → @stackrox/install\n- ui → @stackrox/ui\n\n### Common Error Patterns\n- GraphQL schema validation → @stackrox/core-workflows (90% confidence)\n- panic, FATAL, nil pointer → Extract service name from stack trace (85%)\n- dial tcp, connection refused, deadline exceeded → @stackrox/collector (80%)\n- image pull, scanner, vulnerability detection → @stackrox/scanner (85%)\n- cluster provision, namespace creation → @stackrox/core-workflows (75%)\n\n## Output Locations\n\n**All artifacts are created in:** `artifacts/acs-triage/`\n\n- **issues.json** - Raw issue data from JIRA\n- **triage-report.md** - Detailed markdown report\n- **report.html** - Interactive HTML dashboard\n- **slack-summary.md** - Slack notification template\n\n## Critical Constraints\n\n1. **READ-ONLY MODE**: Generate reports only. Never modify JIRA issues automatically.\n2. **Timeout**: Complete analysis within 300 seconds (5 minutes)\n3. **Issue Limit**: Process 10-20 issues per session (prioritize most recent)\n4. **Confidence Threshold**: Highlight recommendations ≥80% confidence\n5. **No BigQuery**: Use JIRA MCP and similar issue search only\n\n## Reference Data Sources\n\nAlways consult these reference files for domain knowledge:\n\n- `reference/CODEOWNERS-patterns.md` - File path → team mappings\n- `reference/error-signatures.md` - Error pattern → team mappings with confidence\n- `reference/team-mappings.md` - Component/service → team ownership\n- `reference/vulnerability-decision-tree.md` - Complete ProdSec workflow\n- `reference/flaky-test-patterns.md` - Known flaky test patterns and thresholds\n\n## Best Practices\n\n1. **Always check reference files first** before making team assignment decisions\n2. **Use highest confidence strategy** that matches (CODEOWNERS > Error Signature > Service Ownership > History > Test Category)\n3. **Document reasoning** in triage reports (why this team, what evidence, confidence level)\n4. **Flag low confidence** (<70%) for manual review\n5. **Preserve context** from issue description for human reviewers\n6. **Batch similar issues** in reports for efficiency\n\n## Error Handling\n\n- **JIRA timeout**: Process what you have, note incomplete in report\n- **Unknown issue type**: Mark as UNKNOWN, include raw description for manual triage\n- **No team match**: Use \"Needs Manual Assignment\" with evidence summary\n- **Duplicate detection**: Search JIRA for similar summaries/CVEs before recommending closure",
"startupPrompt": "Greet the user and introduce yourself as an ACS Triage Specialist. Briefly explain that you analyze untriaged StackRox/ACS JIRA issues (CI failures, vulnerabilities, flaky tests) and generate triage reports with intelligent team assignments using confidence scoring. List the available commands (/fetch-issues, /classify, /analyze-ci, /analyze-vuln, /analyze-flaky, /assign-team, /generate-report) and ask what they'd like to work on. Mention that you operate in READ-ONLY mode and provide recommendations without modifying JIRA directly."
}
149 changes: 149 additions & 0 deletions .claude/commands/analyze-ci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# /analyze-ci - Analyze CI Failure

## Purpose

Deep analysis of CI failures with error classification, file path extraction, and error signature matching. This command enriches CI_FAILURE issues with technical details needed for accurate team assignment.

## Prerequisites

- artifacts/acs-triage/issues.json exists with type="CI_FAILURE" issues
- `/setup` completed to access stackrox-ci-failure-investigator.md
- `/classify` completed

## Process

1. **Filter CI Failure Issues**
- Read artifacts/acs-triage/issues.json
- Process only issues where type = "CI_FAILURE"

2. **Extract Failure Information**

From description and comments, extract:

a. **Build Metadata**
- Build ID (numeric, e.g., 1963388448995807232)
- Job name (pull-ci-stackrox-stackrox-*)
- PR number
- Test name

b. **Error Messages**
- Primary error message (first ERROR/FATAL line)
- Full error context (surrounding lines)
- Error patterns (panic, FATAL, timeout, etc.)

c. **Stack Traces**
- Goroutine stack traces (if panic)
- File paths with line numbers
- Function names in call stack

d. **File Paths**
- Extract all file paths mentioned in logs
- Example: `central/graphql/resolvers/policies.go:142`
- Normalize paths (remove line numbers for matching)

3. **Classify Error Type**

Check description/comments for patterns:

**GraphQL Errors** (90% confidence → @stackrox/core-workflows)
- "GraphQL schema validation"
- "Cannot query field"
- "__Schema"
- "placeholder Boolean"

**Service Crashes** (85% confidence, team depends on service)
- "panic:"
- "FATAL"
- "nil pointer dereference"
- Extract service name from stack trace

**Timeout/Performance** (80% confidence → @stackrox/collector)
- "deadline exceeded"
- "context deadline"
- "timeout"
- "Timed out after"

**Network Issues** (80% confidence → @stackrox/collector)
- "connection refused"
- "dial tcp"
- "DNS resolution failed"
- "network unreachable"

**Image/Scanning** (85% confidence → @stackrox/scanner)
- "image pull"
- "scanner"
- "vulnerability detection"
- "registry error"

**Test Infrastructure** (75% confidence → @stackrox/core-workflows)
- "cluster provision"
- "namespace creation"
- "test setup failed"

4. **Load Error Signatures**
- Read `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md`
- Extract additional error patterns and team mappings
- Match against issue description/comments

5. **Check for Known Flaky Patterns**
- Cross-reference test name against known flaky tests
- If match found, note pattern and historical frequency
- This may reclassify as FLAKY_TEST

6. **Enrich Issue Object**
Add CI-specific fields:
```json
{
"ci_analysis": {
"build_id": "1963388448995807232",
"job_name": "pull-ci-stackrox-stackrox-master-e2e-tests",
"pr_number": "12345",
"test_name": "TestGlobalSearchLatestTag",
"error_type": "GraphQL",
"error_message": "GraphQL schema validation failed",
"file_paths": ["ui/apps/platform/src/queries/policies.ts", "central/graphql/resolvers/policies.go"],
"stack_trace_summary": "panic in graphql resolver",
"error_signature_match": {
"pattern": "GraphQL schema validation",
"confidence": 90,
"suggested_team": "@stackrox/core-workflows"
},
"known_flaky": false
}
}
```

## Output

- **artifacts/acs-triage/issues.json** - Updated with ci_analysis field for CI_FAILURE issues

## Usage Examples

Basic usage:
```
/analyze-ci
```

## Success Criteria

After running this command, you should have:
- [ ] All CI_FAILURE issues enriched with ci_analysis data
- [ ] Error types classified
- [ ] File paths extracted and normalized
- [ ] Error signatures matched
- [ ] Known flaky patterns checked

## Next Steps

After CI analysis:
1. Run `/assign-team` to perform multi-strategy team assignment
2. Error signature matches provide 85-90% confidence team assignments

## Notes

- Some CI failures may not have clear file paths - use error signatures
- Panics typically have best file path information from stack traces
- Timeout errors often lack specific file paths - use service name
- Known flaky patterns may suggest reclassifying to FLAKY_TEST
- Version mismatches from `/classify` don't affect error classification (errors are stable across versions)
- Build IDs are useful for manual investigation but not used in automated triage
149 changes: 149 additions & 0 deletions .claude/commands/analyze-flaky.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# /analyze-flaky - Analyze Flaky Test

## Purpose

Pattern matching and frequency analysis for flaky tests. Identifies known flaky test patterns, estimates failure frequency from JIRA history, and assigns to test owners.

## Prerequisites

- artifacts/acs-triage/issues.json exists with type="FLAKY_TEST" issues
- `/setup` completed to access stackrox-ci-failure-investigator.md
- `/classify` completed
- JIRA MCP access for historical search

## Process

1. **Filter Flaky Test Issues**
- Read artifacts/acs-triage/issues.json
- Process only issues where type = "FLAKY_TEST"

2. **Extract Test Information**

From summary and description, extract:
- **Test name**: Full test name (e.g., TestGlobalSearchLatestTag)
- **Test file**: File path if available (e.g., tests/e2e/search_test.go)
- **Test category**: E2E, integration, unit, etc.
- **Failure pattern**: Specific assertion or error that fails

3. **Match Known Flaky Patterns**

Read `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` and check for known patterns:

**Known Flaky Tests:**
- **GlobalSearch Latest Tag** → @stackrox/ui (ROX-5355)
- Pattern: DNS timing issue
- Frequency: High

- **PolicyFieldsTest Process UID** → @stackrox/core-workflows (ROX-5298)
- Pattern: Timing-dependent validation
- Frequency: Medium

- **NetworkFlowTest connections** → @stackrox/collector
- Pattern: Network timing
- Frequency: High

- **ImageScanningTest registries** → @stackrox/scanner
- Pattern: Registry timing
- Frequency: Medium

- **SACTest SSH Port** → @stackrox/sensor-ecosystem
- Pattern: Permission timing
- Frequency: Medium

If test matches known pattern:
- Set known_flaky_pattern = true
- Use documented team and historical issue reference
- Note the root cause from pattern documentation

4. **Estimate Failure Frequency**

Search JIRA for historical occurrences:
- Query: `project = ROX AND summary ~ "TestName" AND created >= -30d AND labels = CI_Failure`
- Count results in last 30 days

**Frequency Classification:**
- **High**: >10 occurrences in 30 days
- **Medium**: 3-10 occurrences in 30 days
- **Low**: <3 occurrences in 30 days

Note: This is estimation based on JIRA issues, actual frequency may be higher (many failures don't create tickets)

5. **Assign to Test Owner**

Priority order for team assignment:

a. **Use Known Pattern Team** (95% confidence)
- If test matches known flaky pattern
- Use documented team assignment

b. **Use CODEOWNERS for Test File** (90% confidence)
- If test file path is known
- Read `/tmp/triage/stackrox/.github/CODEOWNERS`
- Match test file path to team

c. **Use Test Category** (70% confidence)
- E2E tests → @stackrox/ui
- Integration tests → Service owner
- Unit tests → Component owner

d. **Fallback to Service Name** (70% confidence)
- Extract service from test name
- Use service ownership mapping

6. **Enrich Issue Object**
Add flaky test-specific fields:
```json
{
"flaky_analysis": {
"test_name": "TestGlobalSearchLatestTag",
"test_file": "tests/e2e/search_test.go",
"test_category": "e2e",
"known_flaky_pattern": true,
"pattern_reference": "ROX-5355",
"root_cause": "DNS timing issue in GlobalSearch",
"failure_frequency": {
"count_30d": 12,
"classification": "High",
"trend": "increasing"
},
"assigned_team": "@stackrox/ui",
"confidence": 95,
"assignment_strategy": "known_pattern"
}
}
```

## Output

- **artifacts/acs-triage/issues.json** - Updated with flaky_analysis field for FLAKY_TEST issues

## Usage Examples

Basic usage:
```
/analyze-flaky
```

## Success Criteria

After running this command, you should have:
- [ ] All FLAKY_TEST issues enriched with flaky_analysis data
- [ ] Known patterns matched where applicable
- [ ] Failure frequency estimated from JIRA history
- [ ] Test owner assigned with confidence score

## Next Steps

After flaky test analysis:
1. Run `/assign-team` for final confidence scoring (if not using known pattern)
2. High-frequency flaky tests should be prioritized for fixing

## Notes

- Known pattern matches have highest confidence (95%)
- Frequency estimation is conservative (only counts JIRA issues, not all CI runs)
- Some flaky tests may not be in known patterns - use CODEOWNERS fallback
- High-frequency flakes (>10/month) should be fixed or test disabled
- Test file paths from CI logs are most reliable for CODEOWNERS matching
- Version mismatch from `/classify` affects CODEOWNERS matching confidence
- Trends (increasing/decreasing frequency) help prioritize fixes
Loading