Skip to content

fix: resolve deployment status blocking without clear reason #26

@dkargatzis

Description

@dkargatzis

Overview

Review and fix deployment protection rule processing to prevent deployments from remaining blocked in "waiting" state without clear reason. Current implementation has several failure modes that leave deployments in an indeterminate state.

Problem Statement

  • Deployments remain blocked even when rules pass
  • No clear error messages when deployment approval fails
  • Silent failures in callback URL processing
  • Missing timeout handling for agent execution
  • No retry logic for failed GitHub API calls
  • Exception handling doesn't ensure deployment status is set

Root Causes Identified

Error Handling Gaps

  • Exceptions in process() method return success=False but don't call callback URL
  • Failed review_deployment_protection_rule calls don't retry or provide fallback
  • Missing validation of callback_url and environment before API calls
  • No timeout mechanism for agent execution that could hang

Deployment Scheduler Issues

  • Time-based violations may not properly re-evaluate and approve
  • Missing error handling in re-evaluation logic
  • No logging for why deployments remain blocked

API Call Reliability

  • No retry logic for transient GitHub API failures
  • Missing validation of API response before proceeding
  • No fallback mechanism when callback URL is invalid

Requirements

Error Handling Improvements

  • Always call callback URL even on exceptions (approve with error message or reject appropriately)
  • Add timeout wrapper for agent execution (max 30 seconds)
  • Validate callback_url and environment before making API calls
  • Implement retry logic for GitHub API calls with exponential backoff
  • Add comprehensive error logging with deployment context

Deployment Status Guarantees

  • Ensure deployment status is always set (approved or rejected)
  • Never leave deployment in "waiting" state without action
  • Add fallback approval mechanism for critical failures
  • Log all deployment status changes with reasoning

Monitoring and Debugging

  • Add structured logging for deployment processing lifecycle
  • Include deployment_id, environment, and callback_url in all log messages
  • Track deployment processing time and identify bottlenecks
  • Add metrics for deployment approval/rejection rates

Code Changes

  • Update src/event_processors/deployment_protection_rule.py error handling
  • Add timeout handling in agent execution path
  • Implement retry logic in _approve_deployment and _reject_deployment
  • Review and fix src/tasks/scheduler/deployment_scheduler.py re-evaluation logic
  • Add validation utilities for deployment callback URLs

Implementation Notes

  • Use existing execute_with_timeout utility from src/core/utils/timeout.py
  • Implement retry logic using retry_with_backoff from src/core/utils/retry.py
  • Add deployment status tracking for observability
  • Ensure backward compatibility with existing deployment flows
  • Add unit tests for error scenarios and timeout handling

Acceptance Criteria

  • Deployments never remain in "waiting" state without action
  • All deployment status changes are logged with clear reasoning
  • Failed API calls are retried with exponential backoff
  • Agent execution has timeout protection
  • Error messages clearly indicate why deployment was blocked or approved
  • Comprehensive test coverage for error scenarios
  • Deployment scheduler properly handles re-evaluation edge cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions