-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Labels
wontfixThis will not be worked onThis will not be worked on
Description
Overview
Review and fix deployment protection rule processing to prevent deployments from remaining blocked in "waiting" state without clear reason. Current implementation has several failure modes that leave deployments in an indeterminate state.
Problem Statement
- Deployments remain blocked even when rules pass
- No clear error messages when deployment approval fails
- Silent failures in callback URL processing
- Missing timeout handling for agent execution
- No retry logic for failed GitHub API calls
- Exception handling doesn't ensure deployment status is set
Root Causes Identified
Error Handling Gaps
- Exceptions in
process()method returnsuccess=Falsebut don't call callback URL - Failed
review_deployment_protection_rulecalls don't retry or provide fallback - Missing validation of callback_url and environment before API calls
- No timeout mechanism for agent execution that could hang
Deployment Scheduler Issues
- Time-based violations may not properly re-evaluate and approve
- Missing error handling in re-evaluation logic
- No logging for why deployments remain blocked
API Call Reliability
- No retry logic for transient GitHub API failures
- Missing validation of API response before proceeding
- No fallback mechanism when callback URL is invalid
Requirements
Error Handling Improvements
- Always call callback URL even on exceptions (approve with error message or reject appropriately)
- Add timeout wrapper for agent execution (max 30 seconds)
- Validate callback_url and environment before making API calls
- Implement retry logic for GitHub API calls with exponential backoff
- Add comprehensive error logging with deployment context
Deployment Status Guarantees
- Ensure deployment status is always set (approved or rejected)
- Never leave deployment in "waiting" state without action
- Add fallback approval mechanism for critical failures
- Log all deployment status changes with reasoning
Monitoring and Debugging
- Add structured logging for deployment processing lifecycle
- Include deployment_id, environment, and callback_url in all log messages
- Track deployment processing time and identify bottlenecks
- Add metrics for deployment approval/rejection rates
Code Changes
- Update
src/event_processors/deployment_protection_rule.pyerror handling - Add timeout handling in agent execution path
- Implement retry logic in
_approve_deploymentand_reject_deployment - Review and fix
src/tasks/scheduler/deployment_scheduler.pyre-evaluation logic - Add validation utilities for deployment callback URLs
Implementation Notes
- Use existing
execute_with_timeoututility fromsrc/core/utils/timeout.py - Implement retry logic using
retry_with_backofffromsrc/core/utils/retry.py - Add deployment status tracking for observability
- Ensure backward compatibility with existing deployment flows
- Add unit tests for error scenarios and timeout handling
Acceptance Criteria
- Deployments never remain in "waiting" state without action
- All deployment status changes are logged with clear reasoning
- Failed API calls are retried with exponential backoff
- Agent execution has timeout protection
- Error messages clearly indicate why deployment was blocked or approved
- Comprehensive test coverage for error scenarios
- Deployment scheduler properly handles re-evaluation edge cases
Metadata
Metadata
Assignees
Labels
wontfixThis will not be worked onThis will not be worked on