-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
name: Feature request
about: Deterministic execution, replay, and audit gaps for long-running Semantic Kernel agent workflows
Issue Description
Semantic Kernel is increasingly used to orchestrate multi-step agent workflows that interact with external systems. However, there is currently no deterministic execution model that supports replayability, auditable state transitions, or safe recovery for long-running workflows.
When a workflow partially executes and fails, there is no reliable way to determine:
- Which steps completed
- Which external side effects occurred
- What memory state existed at the time of failure or How to replay the workflow deterministically.
This limits Semantic Kernel adoption in enterprise and regulated environments.
Scenario description
I'm using Semantic Kernel to orchestrate AI-assisted workflows with these characteristics:
- Multi-step planning and execution
- Tool invocation with side effects (APIs, storage, queues)
- Memory reads and writes during execution
- Long-running execution (minutes to hours)
- Possible restarts, retries, or human-in-the-loop pauses
The workflow must be auditable, replayable, and recoverable.
Example code
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion("gpt-4", endpoint, apiKey)
.Build();
var planner = new SequentialPlanner(kernel);
var plan = await planner.CreatePlanAsync("""
1. Analyze the incoming request
2. Retrieve customer data
3. Call external credit service
4. Persist decision
5. Notify downstream systems
"");
var context = new KernelArguments
{
["requestId"] = "REQ-123",
["tenantId"] = "TENANT-A"
};
await kernel.RunAsync(plan, context);
Failure scenario
If the process crashes or restarts after step 3:
- Step 3 may already have triggered an external side effect.
- Memory may have been mutated.
- Steps 4 and 5 may or may not have executed.
- There is no execution record that clearly captures what happened.
Attempting to rerun the plan risks:
- duplicating side effects,
- violating idempotency,
- producing inconsistent results.
Expected behavior
Semantic Kernel should provide primitives that allow:
- Explicit execution boundaries and checkpoints
- Deterministic replay of workflows with identical inputs
- Clear distinction between reasoning steps and side-effecting steps
- Auditable execution history tied to memory state
- Safe resume or compensation strategies after partial failure
Actual behavior
- Execution has no explicit checkpoints.
- Side effects are not tracked or classified.
- Memory mutations are not versioned.
- There is no built-in execution log or replay mechanism.
- Developers must build custom infrastructure around Semantic Kernel to handle these concerns.
Proposed solution
Introduce a first-class execution model for Semantic Kernel.
- Execution step abstraction - Each step should be explicitly modeled:
public record ExecutionStep(
string StepId,
string FunctionName,
ExecutionStepType StepType, // Reasoning | SideEffect | Idempotent
ExecutionStatus Status,
DateTimeOffset Timestamp
);
- Execution checkpoints - Allow workflows to declare checkpoints:
kernel.Options.EnableCheckpoints = true;
kernel.Options.CheckpointInterval = CheckpointInterval.AfterEachStep;
Checkpoints should capture:
- memory snapshot
- step execution state
- inputs and outputs
- Deterministic replay mode - Provide a replay API:
await kernel.ReplayAsync(
executionId: "exec-123",
ReplayMode.Deterministic
);
Replay mode would:
- reuse recorded decisions and tool outputs (when safe),
- avoid re-triggering side effects,
- allow inspection and debugging.
- Side-effect classification for tools - Allow tool authors to declare behavior:
[KernelFunction(SideEffect = SideEffectType.External)]
public async Task NotifyDownstreamAsync(...) { }
This enables:
- safe retries,
- compensation logic,
- replay without duplication.
- Versioned memory snapshots - Memory should be versioned per execution step:
var snapshot = kernel.Memory.GetSnapshot(stepId);
This allows:
- forensic audit,
- regulatory review,
- decision explainability.
Alternatives considered
- External workflow engines (Durable Functions, Temporal, Camunda) can provide replay and checkpoints, but they do not address Semantic Kernel-specific needs such as plan/step semantics, tool invocation classification, and memory evolution tied to execution steps. Teams still end up building a custom “control layer” around SK.
- Application-level logging is insufficient because it does not provide deterministic replay, memory reconstruction, or safe side-effect handling.