From 5c252da616317d0be2844749e29f1349fe95564c Mon Sep 17 00:00:00 2001 From: Carly de Frondeville Date: Wed, 13 May 2026 18:50:10 -0700 Subject: [PATCH 1/5] Drop backlog recording rule; consume raw Temporal Cloud metric directly The backlog metric pipeline goes from prometheus-adapter directly to the raw temporal_cloud_v1_approximate_backlog_count series, eliminating the temporal_approximate_backlog_count recording rule. Adapter rule: - seriesQuery filters out temporal_worker_build_id="__unversioned__" so discovery doesn't choke on the 5000+ unversioned series in typical accounts. - metricsQuery sum(...) collapses labels the HPA doesn't select on at query time (instance/job/region/task_priority/temporal_account). - metricsRelistInterval is bumped to 5m to accommodate the ~3-minute embedded-timestamp lag in Temporal Cloud's OpenMetrics emission. WRT example, prometheus-stack-values, and demo README are updated to match. Add docs/scaling-recommendations.md covering the empirically measured reactivity model (steady-state ~3:15 dominated by Cloud aggregation lag), task-queue-unload behavior, scale-from-zero limits, and when to pick KEDA over the metric path. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/README.md | 3 + docs/scaling-recommendations.md | 152 ++++++++++++++++++ examples/wrt-hpa-backlog.yaml | 7 +- internal/demo/README.md | 6 +- .../demo/k8s/prometheus-adapter-values.yaml | 25 ++- .../demo/k8s/prometheus-stack-values.yaml | 19 +-- 6 files changed, 182 insertions(+), 30 deletions(-) create mode 100644 docs/scaling-recommendations.md diff --git a/docs/README.md b/docs/README.md index 30a99e56..1051afce 100644 --- a/docs/README.md +++ b/docs/README.md @@ -39,6 +39,9 @@ See [Migration to Unversioned](migration-to-unversioned.md) for how to migrate b ### [Ownership](manager-identity.md) How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control. +### [Scaling Recommendations](scaling-recommendations.md) +Practical reactivity and reliability tradeoffs between HPA + prometheus-adapter and KEDA when scaling Temporal workers per worker-deployment-version. Covers steady-state reactivity (~3:15 via the metric path), task-queue unloading, scale-from-zero limits, and when to pick which tool. + ### [WorkerResourceTemplate](worker-resource-templates.md) How to attach HPAs, PodDisruptionBudgets, and other Kubernetes resources to each active versioned Deployment. Covers the auto-injection model, RBAC setup, webhook TLS, and examples. diff --git a/docs/scaling-recommendations.md b/docs/scaling-recommendations.md new file mode 100644 index 00000000..998f3e34 --- /dev/null +++ b/docs/scaling-recommendations.md @@ -0,0 +1,152 @@ +# Scaling Recommendations: HPA + prometheus-adapter vs KEDA + +This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker-deployment-version on Kubernetes, and recommends which tool fits which workload pattern. + +The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits. + +## TL;DR — Pick by workload pattern + +| Workload pattern | Recommendation | +|------------------|----------------| +| Continuous traffic (task queue always loaded) | HPA + prometheus-adapter, scaling on slot utilization + backlog count | +| Idle periods >5 min between work; needs scale-from-zero | KEDA Temporal scaler | +| Required reactivity < ~3 min from first backlog | KEDA Temporal scaler | +| Required reactivity ~3–5 min, queue always loaded | HPA + prometheus-adapter is fine | +| Very large namespaces (N × M HPAs polling) | HPA + prometheus-adapter; KEDA hits the 50 RPS namespace rate limit | + +## The reactivity model for HPA + prometheus-adapter + +For a continuously-loaded task queue, the end-to-end delay from "backlog appears" to "HPA scales up" decomposes as: + +``` +backlog appears at T0 + └─ Temporal Cloud metric pipeline aggregation +~3 min (out of your control) + └─ Prometheus scrape interval +~10 s + └─ HPA poll interval +~15 s + └─ scale-up stabilization window +~your config + └─ first replica added +``` + +**Steady-state reactivity is ≈ 3 minutes 15 seconds + your stabilization window.** Empirically measured against Temporal Cloud, this is dominated by the embedded-timestamp lag in the OpenMetrics `temporal_cloud_v1_approximate_backlog_count` series — the metric values land in Prometheus with timestamps roughly 3 minutes behind wall clock. With `honor_timestamps: true` in the scrape config (the correct setting; preserves source-of-truth timing), this lag flows through to the HPA. + +You cannot improve this with smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules. The Temporal Cloud aggregation pipeline is the floor. + +### Slot utilization is a much faster leading signal + +`temporal_slot_utilization` is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects current state. It also rises *before* backlog accumulates — slots saturate first, then queueing starts. So a two-metric HPA with both slot util and backlog gives you fast scale-up via slot util and a backlog-driven backstop. + +The demo HPA uses both. For production scaling we recommend keeping both as well. + +## When backlog metric goes silent + +Two distinct failure modes that look similar in HPA events but have different meanings: + +### Mode 1: adapter-level deregistration (rare) +- Trigger: prometheus-adapter pod restart, or *no* series matching the rule's `seriesQuery` exist in Prometheus. +- Symptom in HPA events: `the server could not find the metric ...`. +- Recovery: up to one `metricsRelistInterval` after data flows again. + +prometheus-adapter periodically asks Prometheus "what series exist in the last `metricsRelistInterval`?" — see the [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md). If the discovery query window is shorter than the embedded-timestamp lag of your source data, the discovery returns empty and the metric name disappears from the External Metrics API. That's why this demo configures `metricsRelistInterval: 5m` — empirically the smallest value that reliably catches the 3-min-stale Temporal Cloud samples. + +### Mode 2: series-level silence (common in low-traffic workloads) +- Trigger: a task queue with no polls or new tasks for >5 minutes. Temporal unloads it from memory and stops emitting `temporal_cloud_v1_approximate_backlog_count` for that specific `(task_queue, build_id, ...)` labelset. Other queues' series continue to emit. +- Symptom in HPA events: `no metrics returned from external metrics API`. The metric *name* is still registered; the HPA's specific label selector just matches zero rows now. +- Recovery: traffic resumes → queue reloads → next emission cycle (~1 min) + 3-min aggregation lag → HPA can read value again. + +In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events. + +## Why this demo does not use a backlog recording rule + +A prior version of this demo wrapped the raw Cloud series in a Prometheus recording rule: + +```yaml +- record: temporal_approximate_backlog_count + expr: sum by (...) (temporal_cloud_v1_approximate_backlog_count) +``` + +The rule was originally added to work around a label-formatting issue in an older Temporal Cloud release. With native per-version labels (`temporal_worker_deployment_name`, `temporal_worker_build_id`) now opt-in, the rule no longer earns its keep: + +- **It doesn't reduce reactivity.** The HPA reactivity floor is the upstream Temporal Cloud aggregation lag, not anything the rule could fix. +- **It duplicates the cardinality bill.** Per-`(task_queue, build_id)` labels are already opt-in at the OpenMetrics level *because* of cardinality. Adding a recording rule on top means storing the same high-cardinality series twice. +- **It hides a `sum(...)` that the adapter already does.** prometheus-adapter's `metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})` performs the same collapsing at query time. Pedagogically, "the adapter does the sum" is cleaner than "a recording rule sums first, then the adapter sums again." +- **It does not solve series-level silence.** When the source goes silent (task queue unloaded), the rule output also goes silent eventually (once Prometheus's staleness lookback expires). + +What the recording rule *does* buy is registration stability after operational events: when the source series is sparse-by-timestamp, the rule produces a dense 10-second sample stream that lets the adapter discover with a tight `metricsRelistInterval`. If you find yourself fighting registration flicker on every adapter restart and would rather pay the cardinality cost than tune `metricsRelistInterval`, a recording rule is a reasonable choice. Otherwise, prefer the raw metric. + +In this demo we set `metricsRelistInterval: 5m` and consume the raw metric directly. + +## Why prometheus-adapter cannot do scale-from-zero + +Scale-from-zero on backlog through the metric path requires the metric to exist while there are zero workers. It does not: + +1. Zero workers means no polls. +2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory. +3. An unloaded queue emits no metric. +4. Adapter discovery returns no series, or HPA queries return no rows. +5. HPA cannot scale up because there's no signal to scale on. + +Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA for ~3 minutes (Temporal Cloud aggregation lag + scrape + poll). By the time the HPA reacts, you've already had 3+ minutes of unprovisioned work. + +KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. No metric pipeline involved. Scale-from-zero in seconds. + +## When KEDA hits its own limits + +KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit: + +``` +FrontendGlobalWorkerDeploymentReadRPS = 50 # per namespace, evenly distributed across frontend instances +``` + +For a namespace with N task queues × M worker-deployment-versions = K HPAs, each KEDA poll uses ~1 API call. The polling budget: + +| HPA count | Poll every 30s | Poll every 10s | Poll every 5s | +|-----------|----------------|----------------|---------------| +| 50 | 1.7 RPS (3%) | 5 RPS (10%) | 10 RPS (20%) | +| 250 | 8 RPS (17%) | 25 RPS (50%) | 50 RPS (100%) | +| 1500 | 50 RPS (100%) | exceeds limit | exceeds limit | + +prometheus-adapter has no equivalent per-namespace bottleneck — one OpenMetrics scrape returns all series for the namespace in a single HTTP request, scaling independently of HPA count. + +So for very large namespaces (hundreds–thousands of HPAs) needing fast reactivity, neither path is great: KEDA hits the API rate limit, and the metric path has the 3-min aggregation floor. In practice this is a "talk to your account team" situation. + +## Recommended configuration for the HPA + prometheus-adapter path + +This demo's configuration represents the recommendation, in compact form: + +**Scrape config** (`internal/demo/k8s/prometheus-stack-values.yaml`): +```yaml +- job_name: temporal_cloud + scrape_interval: 10s + honor_timestamps: true + metrics_path: /v1/metrics + params: + labels: + - temporal_worker_deployment_name + - temporal_worker_build_id +``` + +**prometheus-adapter rule** (`internal/demo/k8s/prometheus-adapter-values.yaml`): +```yaml +metricsRelistInterval: 5m # must accommodate Cloud's ~3-min embedded-timestamp lag +rules: + external: + - seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}' + metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})' + name: + as: "temporal_cloud_v1_approximate_backlog_count" + resources: + namespaced: false +``` + +The `seriesQuery` filter excludes `__unversioned__` series. Without it, accounts with many unversioned namespaces produce 5000+ series in the discovery response, which slows or breaks adapter discovery. The filter scopes discovery to versioned workloads — exactly the ones HPAs need. + +**HPA template** (`examples/wrt-hpa-backlog.yaml`): two metrics — slot utilization (fast leading signal, scale-up gate) and backlog count (confirming signal, AverageValue target). + +## References + +- [Temporal Cloud OpenMetrics](https://docs.temporal.io/cloud/metrics/openmetrics) — endpoint and opt-in labels +- [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md) — `metrics-relist-interval` and discovery window semantics +- [prometheus-adapter externalmetrics.md](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md) — external rules, `namespaced: false` for cluster-scoped metrics +- [Prometheus HTTP API: `/api/v1/series`](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) — series discovery semantics +- [Prometheus scrape config: `honor_timestamps`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) — preserving source timestamps +- [KEDA Temporal scaler](https://keda.sh/docs/latest/scalers/temporal/) — direct API polling alternative diff --git a/examples/wrt-hpa-backlog.yaml b/examples/wrt-hpa-backlog.yaml index acf3b221..dbd25c4d 100644 --- a/examples/wrt-hpa-backlog.yaml +++ b/examples/wrt-hpa-backlog.yaml @@ -61,15 +61,16 @@ spec: value: "750m" # Metric: backlog count — scale up when tasks are queued but not yet picked up. - # temporal_approximate_backlog_count is a recording rule that aggregates - # temporal_cloud_v1_approximate_backlog_count down to the four labels the HPA needs. + # Sourced directly from Temporal Cloud's temporal_cloud_v1_approximate_backlog_count + # series; the prometheus-adapter rule wraps it in sum(...) to collapse labels the HPA + # doesn't select on (instance/job/region/task_priority/temporal_account). # temporal_worker_deployment_name, temporal_worker_build_id, and temporal_namespace # are injected automatically by the controller — do not set them here. # temporal_task_queue must be set explicitly to scope the metric to your task queue. - type: External external: metric: - name: temporal_approximate_backlog_count + name: temporal_cloud_v1_approximate_backlog_count selector: matchLabels: temporal_task_queue: "default_helloworld" diff --git a/internal/demo/README.md b/internal/demo/README.md index 66a83091..0ff0869c 100644 --- a/internal/demo/README.md +++ b/internal/demo/README.md @@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/ This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection. -> **Note:** This demo ships a Prometheus recording rule that renames `temporal_cloud_v1_approximate_backlog_count` to `temporal_approximate_backlog_count` and reduces it to the labels the HPA cares about. In principle the HPA can consume the raw Cloud metric directly (set `namespaced: false` on the prometheus-adapter rule so it doesn't auto-inject a `namespace` label filter), but this demo uses the recording rule as a known-working path. +> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a steady-state reactivity of ~3 minutes 15 seconds (dominated by Temporal Cloud's metric pipeline aggregation lag). It cannot do scale-from-zero or sub-3-min reactivity. For those, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model and when to pick which. **Step 1 — Create the Temporal Cloud metrics credentials secret.** @@ -302,11 +302,11 @@ helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapte ```bash kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9092:9090 & -curl -s 'http://localhost:9092/api/v1/query?query=temporal_approximate_backlog_count' \ +curl -s 'http://localhost:9092/api/v1/query?query=temporal_cloud_v1_approximate_backlog_count' \ | jq '.data.result' ``` -You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, wait 15–30s for the recording rule to evaluate. +You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, verify the Temporal Cloud metrics API key secret is correct and that scrape targets are healthy in the Prometheus UI. **Step 4 — Apply the combined WRT.** ```bash diff --git a/internal/demo/k8s/prometheus-adapter-values.yaml b/internal/demo/k8s/prometheus-adapter-values.yaml index 939f1b67..ac44411f 100644 --- a/internal/demo/k8s/prometheus-adapter-values.yaml +++ b/internal/demo/k8s/prometheus-adapter-values.yaml @@ -29,16 +29,25 @@ rules: namespaced: false # cluster-scoped: HPAs in any k8s namespace can consume this metric # Phase 2: approximate backlog count per worker version (from Temporal Cloud). - # Uses the temporal_approximate_backlog_count recording rule, which reduces the raw - # temporal_cloud_v1_approximate_backlog_count (high cardinality, many label dimensions) - # down to just the four labels the HPA needs. cluster-scoped so HPAs in any namespace - # can consume it. - - seriesQuery: 'temporal_approximate_backlog_count{}' + # Consumes temporal_cloud_v1_approximate_backlog_count directly. The metricsQuery's + # sum(...) collapses labels the HPA's matchLabels don't select on + # (instance/job/region/task_priority/temporal_account). + # + # seriesQuery filter rationale: Temporal Cloud emits this metric for *every* namespace + # in your account, including ones not yet opted in to per-version labels — those carry + # temporal_worker_build_id="__unversioned__" and can dominate cardinality (5000+ series + # per account is typical). The adapter chokes on series-discovery responses that large, + # so we filter discovery to versioned series only. + # + # cluster-scoped so HPAs in any namespace can consume it. + - seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}' metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})' name: - as: "temporal_approximate_backlog_count" + as: "temporal_cloud_v1_approximate_backlog_count" resources: namespaced: false # cluster-scoped: HPAs in any namespace can consume this metric -# Must be greater than the Prometheus scrape interval. -metricsRelistInterval: 15s +# Must accommodate Temporal Cloud's embedded-timestamp lag (~3 min) AND have +# margin for emission cadence. 5m is empirically the smallest tested value +# that keeps the metric registered through the 3-min timestamp staleness. +metricsRelistInterval: 5m diff --git a/internal/demo/k8s/prometheus-stack-values.yaml b/internal/demo/k8s/prometheus-stack-values.yaml index a725cea4..892d5ffd 100644 --- a/internal/demo/k8s/prometheus-stack-values.yaml +++ b/internal/demo/k8s/prometheus-stack-values.yaml @@ -11,7 +11,9 @@ # 1. ServiceMonitor — scrapes worker pod metrics (slot gauges) from port 9090 # 2. Temporal Cloud scrape config — scrapes temporal_cloud_v1_approximate_backlog_count # (Phase 2 only; requires a Temporal Cloud metrics API key) -# 3. Recording rules — slot utilization ratio (Phase 1) and backlog count by version (Phase 2) +# 3. Recording rule — slot utilization ratio (Phase 1 only). The backlog count +# is consumed directly from the raw Cloud series via prometheus-adapter; see +# docs/scaling-recommendations.md for the reasoning. # 4. prometheus-adapter — see internal/demo/k8s/prometheus-adapter-values.yaml # ─── 1. ServiceMonitor ────────────────────────────────────────────────────── @@ -81,18 +83,3 @@ additionalPrometheusRulesMap: 1 ) - - name: temporal_cloud_backlog - interval: 10s - rules: - # Backlog count per worker version. Temporal Cloud emits - # temporal_worker_deployment_name and temporal_worker_build_id as separate - # labels (opted in via params.labels in the scrape config), so no label - # manipulation is needed — only cardinality reduction via sum by. - # The prometheus-adapter serves this as a cluster-scoped external metric - # (namespaced: false), so HPAs in any namespace can consume it. - - record: temporal_approximate_backlog_count - expr: | - sum by (temporal_worker_deployment_name, temporal_worker_build_id, task_type, temporal_namespace, temporal_task_queue) ( - temporal_cloud_v1_approximate_backlog_count - ) - From fbf4c5742d06b2b341f9191120a867d1d7533bf3 Mon Sep 17 00:00:00 2001 From: Carly de Frondeville Date: Wed, 13 May 2026 19:44:07 -0700 Subject: [PATCH 2/5] docs: correct backlog reactivity numbers; add gateway-stall caveat Initial scaling-recommendations.md framed steady-state HPA reactivity as ~3:15, citing a "Temporal Cloud aggregation lag." That was wrong. The actual sample-age distribution on the OpenMetrics endpoint is: p50 30s (matches ~1/min emission cadence, age oscillates 0-60s) p95 50s p99 ~tail of occasional gateway-wide stalls So typical end-to-end reactivity is ~85s (emission + scrape + HPA poll), not ~3:15. The 3-minute figures came from observations made during the occasional periods when the OpenMetrics gateway returns frozen timestamps across every series in the account simultaneously - those stalls are real but not steady-state. Doc now: - Replaces the 3:15 figure with empirically-derived ~85s typical. - Adds a "Gateway-wide stalls" caveat describing the frozen-timestamp behavior observationally (no speculation about cause). - Keeps the metricsRelistInterval: 5m recommendation, now justified by the need to exceed stall duration rather than the misattributed "aggregation lag." - Demo README updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/scaling-recommendations.md | 22 ++++++++++++++-------- internal/demo/README.md | 2 +- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/docs/scaling-recommendations.md b/docs/scaling-recommendations.md index 998f3e34..674fb5f6 100644 --- a/docs/scaling-recommendations.md +++ b/docs/scaling-recommendations.md @@ -10,8 +10,8 @@ The `internal/demo/` example wires the HPA path described here. The KEDA path is |------------------|----------------| | Continuous traffic (task queue always loaded) | HPA + prometheus-adapter, scaling on slot utilization + backlog count | | Idle periods >5 min between work; needs scale-from-zero | KEDA Temporal scaler | -| Required reactivity < ~3 min from first backlog | KEDA Temporal scaler | -| Required reactivity ~3–5 min, queue always loaded | HPA + prometheus-adapter is fine | +| Required reactivity < ~60 s from first backlog | KEDA Temporal scaler | +| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter is fine | | Very large namespaces (N × M HPAs polling) | HPA + prometheus-adapter; KEDA hits the 50 RPS namespace rate limit | ## The reactivity model for HPA + prometheus-adapter @@ -20,16 +20,22 @@ For a continuously-loaded task queue, the end-to-end delay from "backlog appears ``` backlog appears at T0 - └─ Temporal Cloud metric pipeline aggregation +~3 min (out of your control) + └─ Temporal Cloud OpenMetrics emission cadence +~60 s worst-case (~1 sample/minute) └─ Prometheus scrape interval +~10 s └─ HPA poll interval +~15 s └─ scale-up stabilization window +~your config └─ first replica added ``` -**Steady-state reactivity is ≈ 3 minutes 15 seconds + your stabilization window.** Empirically measured against Temporal Cloud, this is dominated by the embedded-timestamp lag in the OpenMetrics `temporal_cloud_v1_approximate_backlog_count` series — the metric values land in Prometheus with timestamps roughly 3 minutes behind wall clock. With `honor_timestamps: true` in the scrape config (the correct setting; preserves source-of-truth timing), this lag flows through to the HPA. +**Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence. -You cannot improve this with smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules. The Temporal Cloud aggregation pipeline is the floor. +### Caveat: gateway-wide stalls + +We have observed occasional periods of several minutes during which Temporal Cloud's OpenMetrics gateway returns frozen timestamps for *every* series across the account — backlog series, action counts, error counts, every queue, every namespace. The Prometheus scrape continues to succeed (`up{job="temporal_cloud"}` stays 1, HTTP 200 responses), but the embedded timestamps in the response do not advance. During such a stall, HPAs see the last known value (via Prometheus staleness lookback) until either (a) fresh samples resume, or (b) the staleness window (5 min default) expires and the metric disappears entirely. + +Frequency and duration of these stalls is still being characterized. They are the dominant reactivity-tail risk for the metric path. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA. + +This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected stall, otherwise the metric will deregister during a stall and only re-register on the next adapter discovery cycle after data flows again. ### Slot utilization is a much faster leading signal @@ -46,7 +52,7 @@ Two distinct failure modes that look similar in HPA events but have different me - Symptom in HPA events: `the server could not find the metric ...`. - Recovery: up to one `metricsRelistInterval` after data flows again. -prometheus-adapter periodically asks Prometheus "what series exist in the last `metricsRelistInterval`?" — see the [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md). If the discovery query window is shorter than the embedded-timestamp lag of your source data, the discovery returns empty and the metric name disappears from the External Metrics API. That's why this demo configures `metricsRelistInterval: 5m` — empirically the smallest value that reliably catches the 3-min-stale Temporal Cloud samples. +prometheus-adapter periodically asks Prometheus "what series exist in the last `metricsRelistInterval`?" — see the [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md). If the discovery window is shorter than the longest gateway-wide stall, the discovery returns empty and the metric name disappears from the External Metrics API. The `metricsRelistInterval: 5m` setting buys margin: comfortably longer than typical sample age (~30s p50, ~50s p95) and longer than observed multi-minute gateway stalls so far. ### Mode 2: series-level silence (common in low-traffic workloads) - Trigger: a task queue with no polls or new tasks for >5 minutes. Temporal unloads it from memory and stops emitting `temporal_cloud_v1_approximate_backlog_count` for that specific `(task_queue, build_id, ...)` labelset. Other queues' series continue to emit. @@ -66,7 +72,7 @@ A prior version of this demo wrapped the raw Cloud series in a Prometheus record The rule was originally added to work around a label-formatting issue in an older Temporal Cloud release. With native per-version labels (`temporal_worker_deployment_name`, `temporal_worker_build_id`) now opt-in, the rule no longer earns its keep: -- **It doesn't reduce reactivity.** The HPA reactivity floor is the upstream Temporal Cloud aggregation lag, not anything the rule could fix. +- **It doesn't reduce reactivity.** The HPA reactivity floor is the upstream OpenMetrics emission cadence (~60s), not anything the rule could fix. - **It duplicates the cardinality bill.** Per-`(task_queue, build_id)` labels are already opt-in at the OpenMetrics level *because* of cardinality. Adding a recording rule on top means storing the same high-cardinality series twice. - **It hides a `sum(...)` that the adapter already does.** prometheus-adapter's `metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})` performs the same collapsing at query time. Pedagogically, "the adapter does the sum" is cleaner than "a recording rule sums first, then the adapter sums again." - **It does not solve series-level silence.** When the source goes silent (task queue unloaded), the rule output also goes silent eventually (once Prometheus's staleness lookback expires). @@ -85,7 +91,7 @@ Scale-from-zero on backlog through the metric path requires the metric to exist 4. Adapter discovery returns no series, or HPA queries return no rows. 5. HPA cannot scale up because there's no signal to scale on. -Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA for ~3 minutes (Temporal Cloud aggregation lag + scrape + poll). By the time the HPA reacts, you've already had 3+ minutes of unprovisioned work. +Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute) plus scrape and HPA poll. By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work. KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. No metric pipeline involved. Scale-from-zero in seconds. diff --git a/internal/demo/README.md b/internal/demo/README.md index 0ff0869c..260adf5e 100644 --- a/internal/demo/README.md +++ b/internal/demo/README.md @@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/ This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection. -> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a steady-state reactivity of ~3 minutes 15 seconds (dominated by Temporal Cloud's metric pipeline aggregation lag). It cannot do scale-from-zero or sub-3-min reactivity. For those, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model and when to pick which. +> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence), with occasional multi-minute stalls observed when the Cloud OpenMetrics gateway returns frozen timestamps. It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model and when to pick which. **Step 1 — Create the Temporal Cloud metrics credentials secret.** From 16d5692a05abede235817e0f43d8bbe44584dd9f Mon Sep 17 00:00:00 2001 From: Carly de Frondeville Date: Wed, 13 May 2026 19:47:33 -0700 Subject: [PATCH 3/5] docs: scope gateway-stall caveat to what was directly observed Earlier wording implied multiple stall events ("occasional periods") when we have only directly characterized one such event during this investigation. Reword to describe exactly what was seen, note that frequency is not yet known, and that the behavior is open with the Observability team. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/scaling-recommendations.md | 6 +++--- internal/demo/README.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/scaling-recommendations.md b/docs/scaling-recommendations.md index 674fb5f6..c406b36f 100644 --- a/docs/scaling-recommendations.md +++ b/docs/scaling-recommendations.md @@ -31,11 +31,11 @@ backlog appears at T0 ### Caveat: gateway-wide stalls -We have observed occasional periods of several minutes during which Temporal Cloud's OpenMetrics gateway returns frozen timestamps for *every* series across the account — backlog series, action counts, error counts, every queue, every namespace. The Prometheus scrape continues to succeed (`up{job="temporal_cloud"}` stays 1, HTTP 200 responses), but the embedded timestamps in the response do not advance. During such a stall, HPAs see the last known value (via Prometheus staleness lookback) until either (a) fresh samples resume, or (b) the staleness window (5 min default) expires and the metric disappears entirely. +During our investigation we observed one period of several minutes during which Temporal Cloud's OpenMetrics endpoint returned frozen timestamps for *every* series across the account — backlog series, action counts, error counts, every queue, every namespace, all showing the exact same staleness simultaneously (e.g. all ~30 visible series reading 239s old, identical to the second). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — only the embedded timestamps in the response body did not advance. During such a period, HPAs see the last known value (via Prometheus staleness lookback) until either (a) fresh samples resume, or (b) the staleness window (5 min default) expires and the metric disappears entirely. -Frequency and duration of these stalls is still being characterized. They are the dominant reactivity-tail risk for the metric path. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA. +We have only directly characterized this once, so frequency and typical duration are not yet known. The behavior is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA. -This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected stall, otherwise the metric will deregister during a stall and only re-register on the next adapter discovery cycle after data flows again. +This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected gap so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after data flows again. ### Slot utilization is a much faster leading signal diff --git a/internal/demo/README.md b/internal/demo/README.md index 260adf5e..7fedfab9 100644 --- a/internal/demo/README.md +++ b/internal/demo/README.md @@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/ This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection. -> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence), with occasional multi-minute stalls observed when the Cloud OpenMetrics gateway returns frozen timestamps. It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model and when to pick which. +> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about a multi-minute gateway-wide staleness pattern we observed once during testing. **Step 1 — Create the Temporal Cloud metrics credentials secret.** From 26e953b6c7764b510888b58ec861036beddffaba Mon Sep 17 00:00:00 2001 From: Carly de Frondeville Date: Wed, 13 May 2026 20:08:37 -0700 Subject: [PATCH 4/5] docs: reframe gateway caveat as delivery delay (samples backfill) Verified directly: across a 3-hour window including one of the observed "stall" events, every gap between consecutive sample timestamps in Prometheus's storage is exactly 60 seconds. So the OpenMetrics endpoint isn't dropping or freezing emissions - it's delivering them late, in bursts after a delay, with their original minute-aligned timestamps. The retrospective record looks complete (good for dashboards), but live HPA consumers see the delay as real staleness because they query the latest available timestamp at decision time. Reframe the caveat in the scaling doc and demo README accordingly. Also note we observed two such delay events in ~2 hours of close observation - frequency in normal operation is still open with the Observability team. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/scaling-recommendations.md | 12 ++++++++---- internal/demo/README.md | 2 +- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/scaling-recommendations.md b/docs/scaling-recommendations.md index c406b36f..3bb572e1 100644 --- a/docs/scaling-recommendations.md +++ b/docs/scaling-recommendations.md @@ -29,13 +29,17 @@ backlog appears at T0 **Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence. -### Caveat: gateway-wide stalls +### Caveat: gateway delivery delay -During our investigation we observed one period of several minutes during which Temporal Cloud's OpenMetrics endpoint returned frozen timestamps for *every* series across the account — backlog series, action counts, error counts, every queue, every namespace, all showing the exact same staleness simultaneously (e.g. all ~30 visible series reading 239s old, identical to the second). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — only the embedded timestamps in the response body did not advance. During such a period, HPAs see the last known value (via Prometheus staleness lookback) until either (a) fresh samples resume, or (b) the staleness window (5 min default) expires and the metric disappears entirely. +During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing. -We have only directly characterized this once, so frequency and typical duration are not yet known. The behavior is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA. +Once the delay resolved, the gateway delivered the missing samples with their original minute-aligned timestamps in a burst, so Prometheus's storage ends up with a complete 1/minute series in retrospect. We verified this directly: across a 3-hour window covering one such delay event, every gap between consecutive sample timestamps was exactly 60 seconds, no exceptions. -This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected gap so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after data flows again. +The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in. + +We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA. + +This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes. ### Slot utilization is a much faster leading signal diff --git a/internal/demo/README.md b/internal/demo/README.md index 7fedfab9..7d17a6f0 100644 --- a/internal/demo/README.md +++ b/internal/demo/README.md @@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/ This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection. -> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about a multi-minute gateway-wide staleness pattern we observed once during testing. +> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about an account-wide OpenMetrics delivery-delay pattern we observed during testing (retrospectively backfilled, but real for live HPA queries). **Step 1 — Create the Temporal Cloud metrics credentials secret.** From 4f72016eca508e42e8da720793a91fea5041fbcf Mon Sep 17 00:00:00 2001 From: Carly de Frondeville Date: Wed, 27 May 2026 17:28:17 -0700 Subject: [PATCH 5/5] Apply suggestions from code review Co-authored-by: Jay Pipes Co-authored-by: Stefan Richter --- docs/scaling-recommendations.md | 53 +++++++++++++++++++++++---------- 1 file changed, 37 insertions(+), 16 deletions(-) diff --git a/docs/scaling-recommendations.md b/docs/scaling-recommendations.md index 3bb572e1..01091e45 100644 --- a/docs/scaling-recommendations.md +++ b/docs/scaling-recommendations.md @@ -1,20 +1,30 @@ -# Scaling Recommendations: HPA + prometheus-adapter vs KEDA +# Scaling Recommendations -This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker-deployment-version on Kubernetes, and recommends which tool fits which workload pattern. +This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker deployment version on Kubernetes, and recommends which tool fits which workload pattern. The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits. -## TL;DR — Pick by workload pattern +## TL;DR + +We recommend choosing a scaler approach that aligns with the workload pattern your application exhibits. | Workload pattern | Recommendation | |------------------|----------------| -| Continuous traffic (task queue always loaded) | HPA + prometheus-adapter, scaling on slot utilization + backlog count | -| Idle periods >5 min between work; needs scale-from-zero | KEDA Temporal scaler | +| Continuous traffic (task queue always loaded) | HPA | +| Idle periods >5 min between work OR needs scale-from-zero | KEDA Temporal scaler | | Required reactivity < ~60 s from first backlog | KEDA Temporal scaler | -| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter is fine | -| Very large namespaces (N × M HPAs polling) | HPA + prometheus-adapter; KEDA hits the 50 RPS namespace rate limit | +| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter | +| 1000s of task queues and worker deployment versions | HPA + prometheus-adapter | + +## HPA scaling signal + +This section describes the signal used by HPA + prometheus adapter to adjust the count of workers in a Kubernetes deployment managed by Temporal Worker Controller. + +There are two metric data points that are scraped by HPA + prometheus adapter. + +`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it. -## The reactivity model for HPA + prometheus-adapter +`temporal_slot_utilization` (or just "slot util") is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects the current state of a particular worker. This metrics rises *before* backlog accumulates — slots saturate first, then queueing starts. For a continuously-loaded task queue, the end-to-end delay from "backlog appears" to "HPA scales up" decomposes as: @@ -67,7 +77,7 @@ In a two-metric HPA configured with slot utilization, this is mostly fine: the H ## Why this demo does not use a backlog recording rule -A prior version of this demo wrapped the raw Cloud series in a Prometheus recording rule: +A prior version of this demo wrapped the raw Temporal Cloud series in a Prometheus recording rule: ```yaml - record: temporal_approximate_backlog_count @@ -84,10 +94,20 @@ The rule was originally added to work around a label-formatting issue in an olde What the recording rule *does* buy is registration stability after operational events: when the source series is sparse-by-timestamp, the rule produces a dense 10-second sample stream that lets the adapter discover with a tight `metricsRelistInterval`. If you find yourself fighting registration flicker on every adapter restart and would rather pay the cardinality cost than tune `metricsRelistInterval`, a recording rule is a reasonable choice. Otherwise, prefer the raw metric. In this demo we set `metricsRelistInterval: 5m` and consume the raw metric directly. +## HPA strengths -## Why prometheus-adapter cannot do scale-from-zero +Because HPA uses a single OpenMetrics scrape to gather all series for the namespace in a single HTTP request, the HPA approach scales independently of namespace count. The single HTTP request for OpenMetrics more efficient than KEDA's Temporal API-based approach, and will not run into Temporal API rate limiting problems (see section below on [KEDA limitations](#keda-limitations)). -Scale-from-zero on backlog through the metric path requires the metric to exist while there are zero workers. It does not: +HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment. +## HPA limitations + +This section describes two known limitations for HPA + prometheus adapter. + +Temporal Cloud's OpenMetrics endpoint may sometimes return the same embedded timestamps on repeated scrapes for each series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace. This delay in returning fresh metrics data can impact the speed to which HPA + prometheus adapter scales out or in the replica count for a worker deployment version. This means that HPA + prometheus adapter may not be a good solution if your workload cannot tolerate occasional multi-minute scaling pauses. + +> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes. + +HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue. 1. Zero workers means no polls. 2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory. @@ -95,11 +115,13 @@ Scale-from-zero on backlog through the metric path requires the metric to exist 4. Adapter discovery returns no series, or HPA queries return no rows. 5. HPA cannot scale up because there's no signal to scale on. -Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute) plus scrape and HPA poll. By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work. +Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute). By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work. + +## KEDA strengths -KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. No metric pipeline involved. Scale-from-zero in seconds. +KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. This allows KEDA to scale Temporal workers from zero. -## When KEDA hits its own limits +## KEDA limitations KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit: @@ -115,9 +137,8 @@ For a namespace with N task queues × M worker-deployment-versions = K HPAs, eac | 250 | 8 RPS (17%) | 25 RPS (50%) | 50 RPS (100%) | | 1500 | 50 RPS (100%) | exceeds limit | exceeds limit | -prometheus-adapter has no equivalent per-namespace bottleneck — one OpenMetrics scrape returns all series for the namespace in a single HTTP request, scaling independently of HPA count. -So for very large namespaces (hundreds–thousands of HPAs) needing fast reactivity, neither path is great: KEDA hits the API rate limit, and the metric path has the 3-min aggregation floor. In practice this is a "talk to your account team" situation. +If you are using KEDA with Temporal Cloud and hitting the API rate limit described above, you will need to contact your Temporal Cloud account team to discuss increasing the rate limits. ## Recommended configuration for the HPA + prometheus-adapter path