Commit c262cf8
authored
Advanced Workflow Features for vMCP Composition (#2592)
## Overview
Implements advanced workflow features for Virtual MCP Composite Tools, including DAG-based parallel execution, step dependencies, sophisticated error handling, and workflow state management. This completes Phase 2 of the composition work.
**Issue**: Closes #156 (stacklok/stacklok-epics)
## What Changed
### Core Features
#### 1. DAG-Based Parallel Execution
- **New file**: [pkg/vmcp/composer/dag_executor.go](pkg/vmcp/composer/dag_executor.go)
- Implements topological sort using Kahn's algorithm to build execution levels
- Executes independent steps in parallel using `errgroup` for coordination
- Semaphore-based concurrency limiting (default: 10 parallel steps)
- Automatic optimization: steps with no dependencies run concurrently
- Performance improvement: parallel execution reduces workflow time by ~60-70% for independent steps
#### 2. Step Dependencies
- `depends_on` field support in [pkg/vmcp/composer/composer.go:67](pkg/vmcp/composer/composer.go#L67)
- Dependency graph validation with cycle detection using DFS
- Transitive dependencies automatically handled
- Missing dependency validation at workflow definition time
#### 3. Advanced Error Handling
- **Three-level error handling**:
- Step-level: `on_error.continue_on_error` overrides workflow-level settings
- Workflow-level: `failure_mode` (abort/continue/best_effort)
- Automatic: retry with exponential backoff
- **Retry logic** in [pkg/vmcp/composer/workflow_engine.go:311-350](pkg/vmcp/composer/workflow_engine.go#L311-L350):
- Configurable retry count and initial delay
- Exponential backoff (2^attempt * initial_delay, max 60x)
- Safety cap: maximum 10 retries to prevent infinite loops
#### 4. Workflow State Management
- **Pluggable state store interface**: [pkg/vmcp/composer/composer.go:191-217](pkg/vmcp/composer/composer.go#L191-L217)
- **In-memory implementation**: [pkg/vmcp/composer/state_store.go](pkg/vmcp/composer/state_store.go)
- Thread-safe operations with mutex protection
- Deep copying to prevent external modifications
- Automatic cleanup of stale workflows (configurable intervals)
- Ready for future Redis/DB backends
#### 5. Workflow Lifecycle
- **UUID-based workflow IDs** for unique identification
- **State checkpointing** after each step completion
- **Configurable timeouts** (default: 30 minutes for workflows, 5 minutes for steps)
- **Automatic cleanup** of completed/failed/timed-out workflows
- **Workflow cancellation** support via state store
### Files Added
- `pkg/vmcp/composer/dag_executor.go` - DAG execution engine
- `pkg/vmcp/composer/dag_executor_test.go` - DAG executor unit tests (9 test cases)
- `pkg/vmcp/composer/state_store.go` - In-memory workflow state store
- `pkg/vmcp/composer/state_store_test.go` - State store unit tests (14 test cases)
- `test/e2e/vmcp_workflow_e2e_test.go` - End-to-end workflow tests
- `docs/operator/advanced-workflow-patterns.md` - Comprehensive guide (797 lines)
- `docs/operator/composite-tools-quick-reference.md` - Quick reference (233 lines)
### Files Modified
- `pkg/vmcp/composer/workflow_engine.go` - Integrated DAG executor and state management
- `pkg/vmcp/composer/workflow_engine_test.go` - Added retry and timeout tests
- `pkg/vmcp/composer/composer.go` - Added state store interface and error types
- `pkg/vmcp/composer/workflow_context.go` - Enhanced context management
- `docs/operator/virtualmcpcompositetooldefinition-guide.md` - Updated with advanced features
## Example Usage
### Parallel Incident Investigation Workflow
```yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPCompositeToolDefinition
metadata:
name: incident-investigation
spec:
name: investigate_incident
steps:
# Level 1: Parallel data fetching
- id: fetch_logs
type: tool
tool: splunk.fetch_logs
arguments:
incident_id: "{{.params.incident_id}}"
- id: fetch_metrics
type: tool
tool: datadog.fetch_metrics
arguments:
incident_id: "{{.params.incident_id}}"
- id: fetch_traces
type: tool
tool: jaeger.fetch_traces
arguments:
incident_id: "{{.params.incident_id}}"
# Level 2: Correlation (waits for all Level 1)
- id: correlate
type: tool
tool: analysis.correlate
depends_on: [fetch_logs, fetch_metrics, fetch_traces]
arguments:
logs: "{{.steps.fetch_logs.output}}"
metrics: "{{.steps.fetch_metrics.output}}"
traces: "{{.steps.fetch_traces.output}}"
on_error:
action: retry
retry_count: 3
retry_delay: 2s
# Level 3: Report creation
- id: create_report
type: tool
tool: jira.create_issue
depends_on: [correlate]
arguments:
title: "Incident {{.params.incident_id}}"
body: "{{.steps.correlate.output.summary}}"
```
**Performance**: 3 parallel fetches complete in ~1x time instead of 3x sequential time.
## Test Coverage
### Unit Tests
- ✅ Topological sort (7 test cases covering chains, diamonds, complex DAGs)
- ✅ Cycle detection (3 test cases: direct, indirect, self-reference)
- ✅ Parallel execution verification (timing-based)
- ✅ Dependency ordering enforcement
- ✅ Error handling (abort/continue/best_effort modes)
- ✅ Retry logic with exponential backoff
- ✅ Concurrency limiting with semaphore
- ✅ Context cancellation
- ✅ State store operations (14 comprehensive tests)
- ✅ State store cleanup and concurrency
### Integration & E2E Tests
- ✅ Complex 8-step incident investigation workflow
- ✅ End-to-end parallel execution with mock backends
- ✅ Dependency ordering validation with timing verification
**All tests passing** ✅
## Performance Metrics
From test results:
- **Parallel speedup**: 3 independent 100ms steps complete in ~100ms (not 300ms)
- **Complex workflow**: 8-step workflow completes in ~200ms (vs 400ms sequential)
- **Concurrency control**: Semaphore effectively limits parallel execution
- **Cleanup efficiency**: Stale workflows removed within 2 cleanup cycles
## Architecture Highlights
1. **Clean Separation**: DAG execution, state management, and workflow orchestration are independent modules
2. **Pluggable Design**: State store interface enables future Redis/PostgreSQL implementations
3. **Safety First**: Multiple safeguards (max steps: 100, max retries: 10, semaphore limits)
4. **Thread Safety**: Proper mutex usage, deep copying, and goroutine management with errgroup
5. **Context Propagation**: Cancellation and timeouts properly propagated through execution stack
6. **Observability**: Comprehensive logging of execution stats, timing, and state metrics
## Documentation
- **[Advanced Workflow Patterns](docs/operator/advanced-workflow-patterns.md)**: 797-line comprehensive guide covering:
- Parallel execution with DAG
- Step dependencies and patterns (diamond, fan-out/fan-in)
- Error handling strategies with examples
- State management and lifecycle
- Performance optimization techniques
- Best practices and common patterns
- **[Quick Reference](docs/operator/composite-tools-quick-reference.md)**: 233-line guide for rapid development
## Breaking Changes
None. This is a backward-compatible enhancement. Existing workflows without dependencies execute as before.
## Migration Notes
- **State tracking** requires creating a state store: `composer.NewInMemoryStateStore(cleanupInterval, maxAge)`
- **Parallel execution** is automatic for steps without `depends_on` - no migration needed
- **Retry configuration** is opt-in via `on_error.action: retry`
## Future Work (Out of Scope)
- Distributed state store (Redis/PostgreSQL) - interface ready
- Workflow pause/resume
- Step-level timeout configuration
- Conditional branching (marked as Phase 3)1 parent acf8502 commit c262cf8
File tree
14 files changed
+1908
-77
lines changed- docs/operator
- pkg/vmcp
- composer
- server
14 files changed
+1908
-77
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
173 | | - | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
174 | 176 | | |
175 | 177 | | |
176 | 178 | | |
| |||
695 | 697 | | |
696 | 698 | | |
697 | 699 | | |
698 | | - | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
699 | 717 | | |
700 | | - | |
| 718 | + | |
701 | 719 | | |
702 | | - | |
703 | | - | |
704 | | - | |
| 720 | + | |
705 | 721 | | |
706 | | - | |
707 | | - | |
| 722 | + | |
| 723 | + | |
708 | 724 | | |
709 | 725 | | |
710 | 726 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
315 | 316 | | |
316 | 317 | | |
317 | 318 | | |
| 319 | + | |
318 | 320 | | |
319 | 321 | | |
320 | 322 | | |
321 | 323 | | |
322 | 324 | | |
| 325 | + | |
323 | 326 | | |
324 | 327 | | |
325 | 328 | | |
| 329 | + | |
326 | 330 | | |
327 | 331 | | |
328 | 332 | | |
| 333 | + | |
| 334 | + | |
329 | 335 | | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
330 | 339 | | |
331 | 340 | | |
332 | 341 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
0 commit comments