Skip to content

Make retry behavior configurable #50

@blindzero

Description

@blindzero

Problem Statement

Retry behavior is static in Invoke-IdlePlanObject. This might be too fixed and needs flexible approach.

Summary

Make IdLE's retry policy configurable by the host (executor parameters / engine options), while keeping retries safe-by-default (retry only for errors explicitly marked as transient).

This is a follow-up to Issue #11: Execution: safe retries for transient failures (fail-fast).

Motivation

Right now the engine uses hardcoded defaults for retry behavior (attempt count, backoff, delay). Different hosts/environments may need different values (e.g., CI vs. production, slow vs. fast providers), but we must avoid exposing timing/DoS knobs to untrusted workflow definitions.

Goals

  • Allow the host to configure retry behavior per run.
  • Preserve safe-by-default semantics:
    • Retry only when Exception.Data['Idle.IsTransient'] = $true (or equivalent marker).
    • Fail fast for non-transient errors.
  • Validate and normalize options deterministically.
  • Keep the core headless (no host-specific dependencies).

Non-goals

  • Do not make retry policy configurable via workflow/plan definitions.
  • Do not add provider-specific heuristics for transient detection in this issue.
  • Do not implement per-step/per-step-type override rules (can be a later enhancement).

Proposal

Public API

Add an optional parameter to the public executor(s), e.g.:

  • Invoke-IdlePlan and Invoke-IdlePlanObject:
    • -ExecutionOptions (preferred) or -RetryPolicy

Example shape:

$executionOptions = @{
  RetryPolicy = @{
    MaxAttempts = 3
    InitialDelayMilliseconds = 250
    BackoffFactor = 2.0
    MaxDelayMilliseconds = 5000
    JitterRatio = 0.2
  }
}

Invoke-IdlePlanObject -Plan $plan -Providers $providers -ExecutionOptions $executionOptions

Validation

  • Reject ScriptBlocks in options (same rule as for plan/provider inputs).
  • Validate ranges:
    • MaxAttempts: 1..50
    • InitialDelayMilliseconds: 0..600000
    • BackoffFactor: 1.0..100.0
    • MaxDelayMilliseconds: 0..600000
    • JitterRatio: 0.0..1.0

Defaults

If host does not supply a policy, use current engine defaults:

  • MaxAttempts=3
  • InitialDelayMilliseconds=250
  • BackoffFactor=2.0
  • MaxDelayMilliseconds=5000
  • JitterRatio=0.2

Security considerations

  • Options are trusted host input, not workflow input.
  • Workflow/plan remain data-only and must not control retry timing.
  • Transient detection remains marker-based to avoid unsafe retries.

Testing

Add/extend Pester tests:

  • Host can set MaxAttempts=1 and transient errors fail without retry.
  • Host can set MaxAttempts=3 and a transient error retries and succeeds.
  • Invalid values are rejected (range validation).
  • Options with ScriptBlocks are rejected.

Mock Start-Sleep to keep tests fast.

Documentation

  • Update cmdlet reference (platyPS) after public parameter changes.
  • Add a short note in docs about "host-owned execution options" (wherever execution policy is described).

Definition of Done

  • Public API updated with optional execution options.
  • Input validation added (no script blocks, range checks).
  • Unit tests added/updated (Linux + Windows CI green).
  • Docs regenerated/updated as needed.
  • No changes to workflow/plan schema required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions