From a3868c929f8aac822d01c1e6882d84e4d646ae19 Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sat, 25 Apr 2026 22:35:57 +0800 Subject: [PATCH 1/7] docs: update workers howto for ErrSkipTick, zombie cleanup, handler-as-metadata - Add ErrSkipTick to handler return values table with usage example - Document automatic zombie child cleanup in Dynamic Workers section - Add handler-as-metadata reconciler example (config change detection via GetChild().GetHandler() type assertion) - Add WithSkipOnNotAcquired convenience + footgun warning for WithOnNotAcquired error return - Update WorkerInfo methods table with GetHandler, GetChildCount - Note Locker interface compatibility for existing implementations Companion to go-coldbrew/workers#6. --- howto/workers.md | 76 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 74 insertions(+), 2 deletions(-) diff --git a/howto/workers.md b/howto/workers.md index 8ad82ea..436f130 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -169,6 +169,7 @@ The handler receives a `context.Context` for cancellation and a `*WorkerInfo` fo | Return value | Long-running worker (no `Every`) | Periodic worker (with `Every`) | |---|---|---| | `return nil` | Worker stops permanently | Cycle succeeded — next tick fires | +| `return workers.ErrSkipTick` | No effect (not meaningful) | Tick skipped — next tick fires normally | | `return error` | Restarts with backoff (if restart enabled) | Restarts with backoff (if restart enabled) | | `return ctx.Err()` | Clean shutdown | Clean shutdown | | `return workers.ErrDoNotRestart` | Permanent stop | Permanent stop | @@ -177,6 +178,22 @@ The handler receives a `context.Context` for cancellation and a `*WorkerInfo` fo **Periodic workers** run the handler once per tick. Return nil for success (next tick fires normally). Return an error to trigger restart. The `Every` wrapper manages the tick loop — your handler just processes one cycle. +### ErrSkipTick + +Return `workers.ErrSkipTick` from a periodic handler when a tick fails transiently (DB timeout, network blip) and you want to skip it without triggering a full restart. The timer continues and the next tick fires normally: + +```go +func pollDatabase(ctx context.Context, info *workers.WorkerInfo) error { + rows, err := db.QueryContext(ctx, "SELECT ...") + if err != nil { + return workers.ErrSkipTick // skip this tick, try again next interval + } + return processRows(rows) +} +``` + +Without `ErrSkipTick`, you'd have to swallow errors by returning nil and track them internally. `ErrSkipTick` gives the framework visibility into skipped ticks while keeping the timer going. + ### ErrDoNotRestart Return `workers.ErrDoNotRestart` from a handler to signal permanent completion — the supervisor will not restart the worker even though restart is enabled by default. `ChannelWorker` and `BatchChannelWorker` return this automatically when their channel is closed. @@ -408,6 +425,18 @@ middleware.DistributedLock(redisLocker, ) ``` +For the common case of logging and skipping, use `WithSkipOnNotAcquired`: + +```go +middleware.DistributedLock(redisLocker, + middleware.WithSkipOnNotAcquired(func(ctx context.Context, name string) { + log.Info(ctx, "msg", "lock held, skipping", "worker", name) + }), +) +``` + +**Caution:** `WithOnNotAcquired` returns an error. A non-nil return triggers restart for periodic workers. Use `WithSkipOnNotAcquired` or return nil if you want to skip without restart. + The `Locker` interface: ```go @@ -417,6 +446,8 @@ type Locker interface { } ``` +If your lock implementation already has these two methods with matching signatures, it satisfies `Locker` directly — no adapter needed. + Release uses `context.WithoutCancel` so that context cancellation does not prevent lock cleanup. ### Timeout @@ -468,12 +499,14 @@ Every handler receives a `*WorkerInfo` that carries worker metadata and child ma |--------|-------------| | `GetName() string` | Worker name | | `GetAttempt() int` | Restart attempt (0 on first run) | +| `GetHandler() CycleHandler` | The worker's handler — use type assertion for handler-specific state | | `Add(w *Worker) bool` | Add child worker — returns false if name already exists (no-op) | | `Remove(name string)` | Stop child worker by name | -| `GetChildren() []string` | Names of running child workers | +| `GetChildren() []string` | Names of running child workers (stopped children auto-pruned) | | `GetChild(name string) (Worker, bool)` | Look up a child by name (returns a value copy) | +| `GetChildCount() int` | Number of running children (cheaper than `len(GetChildren())`) | -Use `Worker.GetName()` and `Worker.GetHandler()` to inspect a child. +Use `Worker.GetName()`, `Worker.GetHandler()`, `Worker.GetInterval()`, and `Worker.GetRestartOnFail()` to inspect a child. To replace a running worker, call `Remove` then `Add`. This is not atomic — there is a brief window where the worker is not running. @@ -579,6 +612,45 @@ info.Add(workers.NewWorker("solver").HandlerFunc(makeSolver(newCfg))) Note: `Remove` + `Add` is not atomic — there is a brief window where the worker is not running. +**Automatic cleanup:** When a child permanently stops (returns nil with `WithRestart(false)`, returns `ErrDoNotRestart`, or exhausts restart attempts), it is automatically pruned from the children map on the next call to `GetChildren`, `GetChild`, or `GetChildCount`. No manual cleanup needed. + +### Example: Config change detection via handler + +Instead of maintaining a parallel map to track per-worker state (e.g., config versions), store metadata on your `CycleHandler` implementation and inspect it via `GetChild().GetHandler()` type assertion: + +```go +type solverHandler struct { + version int64 + cfg SolverConfig +} + +func (h *solverHandler) RunCycle(ctx context.Context, info *workers.WorkerInfo) error { + return solve(ctx, h.cfg) +} + +func (h *solverHandler) Close() error { return nil } +``` + +In the reconciler, detect config changes without a parallel tracking map: + +```go +for key, desired := range desiredConfigs { + child, exists := info.GetChild(key) + if exists { + if h, ok := child.GetHandler().(*solverHandler); ok && h.version == desired.version { + continue // config unchanged, skip + } + info.Remove(key) // config changed, replace + } + info.Add(workers.NewWorker(key).Handler(&solverHandler{ + version: desired.version, + cfg: desired.cfg, + })) +} +``` + +The handler is an interface value — `GetChild()` returns a copy of the `Worker` struct, but the handler reference is shared. Type assertion gives read access to the original handler's fields. + ### Example: Fixed children on startup A worker that spawns N consumer goroutines when it starts: From 2a218761bc7b7766999f2a6953ec6e90ad4443c9 Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sat, 25 Apr 2026 23:29:50 +0800 Subject: [PATCH 2/7] fix: address review comments on workers howto - Add defer rows.Close() to pollDatabase example - Fix WithOnNotAcquired wording (callback returns error, not function) - Rewrite automatic cleanup note with timing caveat - Simplify handler reference sharing explanation - Remove GetChildCount from WorkerInfo table (removed from API) --- howto/workers.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/howto/workers.md b/howto/workers.md index 436f130..69ee55b 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -188,6 +188,7 @@ func pollDatabase(ctx context.Context, info *workers.WorkerInfo) error { if err != nil { return workers.ErrSkipTick // skip this tick, try again next interval } + defer rows.Close() return processRows(rows) } ``` @@ -435,7 +436,7 @@ middleware.DistributedLock(redisLocker, ) ``` -**Caution:** `WithOnNotAcquired` returns an error. A non-nil return triggers restart for periodic workers. Use `WithSkipOnNotAcquired` or return nil if you want to skip without restart. +**Caution:** If the `WithOnNotAcquired` callback returns a non-nil error, the framework treats it as a cycle failure — for periodic workers, this triggers restart with backoff. Use `WithSkipOnNotAcquired` or return nil from the callback if you want to skip without restart. The `Locker` interface: @@ -504,7 +505,6 @@ Every handler receives a `*WorkerInfo` that carries worker metadata and child ma | `Remove(name string)` | Stop child worker by name | | `GetChildren() []string` | Names of running child workers (stopped children auto-pruned) | | `GetChild(name string) (Worker, bool)` | Look up a child by name (returns a value copy) | -| `GetChildCount() int` | Number of running children (cheaper than `len(GetChildren())`) | Use `Worker.GetName()`, `Worker.GetHandler()`, `Worker.GetInterval()`, and `Worker.GetRestartOnFail()` to inspect a child. @@ -612,7 +612,7 @@ info.Add(workers.NewWorker("solver").HandlerFunc(makeSolver(newCfg))) Note: `Remove` + `Add` is not atomic — there is a brief window where the worker is not running. -**Automatic cleanup:** When a child permanently stops (returns nil with `WithRestart(false)`, returns `ErrDoNotRestart`, or exhausts restart attempts), it is automatically pruned from the children map on the next call to `GetChildren`, `GetChild`, or `GetChildCount`. No manual cleanup needed. +**Automatic cleanup:** When a child permanently stops (see the [return value table](#handler-return-values) for what triggers permanent stop), it is automatically excluded from `GetChildren` and `GetChild`. Suture is the source of truth — no manual cleanup needed. Note that there may be a brief delay between the child stopping and the change being visible, since suture processes stop events asynchronously. ### Example: Config change detection via handler @@ -649,7 +649,7 @@ for key, desired := range desiredConfigs { } ``` -The handler is an interface value — `GetChild()` returns a copy of the `Worker` struct, but the handler reference is shared. Type assertion gives read access to the original handler's fields. +`GetChild()` returns a copy of the `Worker` struct, but the handler is stored as a `CycleHandler` interface — use type assertion to access handler-specific fields for change detection or metadata inspection. ### Example: Fixed children on startup From a65e0052ffb5eb338185b058d231fe6bd88eb4cd Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sun, 26 Apr 2026 16:59:16 +0800 Subject: [PATCH 3/7] fix: add GetChildCount back to WorkerInfo table GetChildCount is present in the API (backed by a map, genuinely cheaper than len(GetChildren()) which allocates a sorted slice). --- howto/workers.md | 1 + 1 file changed, 1 insertion(+) diff --git a/howto/workers.md b/howto/workers.md index 69ee55b..4869fc3 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -505,6 +505,7 @@ Every handler receives a `*WorkerInfo` that carries worker metadata and child ma | `Remove(name string)` | Stop child worker by name | | `GetChildren() []string` | Names of running child workers (stopped children auto-pruned) | | `GetChild(name string) (Worker, bool)` | Look up a child by name (returns a value copy) | +| `GetChildCount() int` | Number of running children (cheaper than `len(GetChildren())`) | Use `Worker.GetName()`, `Worker.GetHandler()`, `Worker.GetInterval()`, and `Worker.GetRestartOnFail()` to inspect a child. From 86878bea214f04a49b0f77af9ac90675d2aff16b Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sun, 26 Apr 2026 19:21:35 +0800 Subject: [PATCH 4/7] fix: ErrSkipTick table wording and example ctx handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Return values table: "No effect" → "Treated like return error" for ErrSkipTick in long-running workers (it's not a no-op) - pollDatabase example: check ctx.Err() before returning ErrSkipTick so context cancellation triggers clean shutdown, not a skip --- howto/workers.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/howto/workers.md b/howto/workers.md index 4869fc3..922083e 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -169,7 +169,7 @@ The handler receives a `context.Context` for cancellation and a `*WorkerInfo` fo | Return value | Long-running worker (no `Every`) | Periodic worker (with `Every`) | |---|---|---| | `return nil` | Worker stops permanently | Cycle succeeded — next tick fires | -| `return workers.ErrSkipTick` | No effect (not meaningful) | Tick skipped — next tick fires normally | +| `return workers.ErrSkipTick` | Treated like `return error` (not meaningful) | Tick skipped — next tick fires normally | | `return error` | Restarts with backoff (if restart enabled) | Restarts with backoff (if restart enabled) | | `return ctx.Err()` | Clean shutdown | Clean shutdown | | `return workers.ErrDoNotRestart` | Permanent stop | Permanent stop | @@ -186,7 +186,10 @@ Return `workers.ErrSkipTick` from a periodic handler when a tick fails transient func pollDatabase(ctx context.Context, info *workers.WorkerInfo) error { rows, err := db.QueryContext(ctx, "SELECT ...") if err != nil { - return workers.ErrSkipTick // skip this tick, try again next interval + if ctx.Err() != nil { + return ctx.Err() // context cancelled — clean shutdown + } + return workers.ErrSkipTick // transient failure, try again next interval } defer rows.Close() return processRows(rows) From 0a704291b936d56fff8e84295a8e2ae3f831c434 Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sun, 26 Apr 2026 20:50:49 +0800 Subject: [PATCH 5/7] docs: recommend keeping restart enabled for periodic workers --- howto/workers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/howto/workers.md b/howto/workers.md index 922083e..6f5d224 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -218,7 +218,7 @@ func processQueue(ctx context.Context, info *workers.WorkerInfo) error { |--------|-------------|---------| | `HandlerFunc(fn)` | Set handler from a plain function | — | | `Handler(h)` | Set handler from a `CycleHandler` struct | — | -| `WithRestart(false)` | Disable restart (one-shot worker) | `true` (restart with backoff) | +| `WithRestart(false)` | Disable restart (one-shot worker). Periodic workers should generally keep the default; use `ErrSkipTick`/`ErrDoNotRestart` instead. | `true` (restart with backoff) | | `Every(duration)` | Run periodically on a fixed interval | — | | `WithJitter(percent)` | Randomize tick interval by ±percent (requires `Every`) | inherit run-level | | `WithInitialDelay(d)` | Delay first tick (requires `Every`) | — | From d7df265a2094b9e6ed5f52a227358bb2c6fb453c Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sun, 26 Apr 2026 21:03:18 +0800 Subject: [PATCH 6/7] fix: consistent suture naming in automatic cleanup note --- howto/workers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/howto/workers.md b/howto/workers.md index 6f5d224..ed9c71f 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -616,7 +616,7 @@ info.Add(workers.NewWorker("solver").HandlerFunc(makeSolver(newCfg))) Note: `Remove` + `Add` is not atomic — there is a brief window where the worker is not running. -**Automatic cleanup:** When a child permanently stops (see the [return value table](#handler-return-values) for what triggers permanent stop), it is automatically excluded from `GetChildren` and `GetChild`. Suture is the source of truth — no manual cleanup needed. Note that there may be a brief delay between the child stopping and the change being visible, since suture processes stop events asynchronously. +**Automatic cleanup:** When a child permanently stops (see the [return value table](#handler-return-values) for what triggers permanent stop), it is automatically excluded from `GetChildren` and `GetChild`. The underlying [suture] supervisor is the source of truth — no manual cleanup needed. Note that there may be a brief delay between the child stopping and the change being visible, as stop events are processed asynchronously. ### Example: Config change detection via handler From 4dcfe1bb9de3bee2ff007aa424af7bfc5ec41d3f Mon Sep 17 00:00:00 2001 From: Ankur Shrivastava Date: Sun, 26 Apr 2026 22:00:12 +0800 Subject: [PATCH 7/7] docs: document always-sampled worker spans and trace ID in logs --- howto/workers.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/howto/workers.md b/howto/workers.md index ed9c71f..0298041 100644 --- a/howto/workers.md +++ b/howto/workers.md @@ -394,7 +394,9 @@ middleware.Recover(func(name string, v any) { ### Tracing -Creates an OTEL span named `worker::cycle` for each tick. Records errors on the span: +Creates an OTEL span named `worker::cycle` for each tick. Records errors on the span. Worker spans are always sampled regardless of the global sampler — this prevents silent span drops when using `ParentBased(TraceIDRatioBased(...))`, where worker root spans (which have no incoming parent) would otherwise be probabilistically dropped. + +The OTEL trace ID is automatically injected into the log context as `trace` for correlation with your tracing backend. ```go middleware.Tracing()