temporalio · carlydf · May 14, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
@@ -39,6 +39,9 @@ See [Migration to Unversioned](migration-to-unversioned.md) for how to migrate b
 ### [Ownership](manager-identity.md)
 How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control.
 
+### [Scaling Recommendations](scaling-recommendations.md)
+Practical reactivity and reliability tradeoffs between HPA + prometheus-adapter and KEDA when scaling Temporal workers per worker-deployment-version. Covers steady-state reactivity (~3:15 via the metric path), task-queue unloading, scale-from-zero limits, and when to pick which tool.
+
 ### [WorkerResourceTemplate](worker-resource-templates.md)
 How to attach HPAs, PodDisruptionBudgets, and other Kubernetes resources to each active versioned Deployment. Covers the auto-injection model, RBAC setup, webhook TLS, and examples.
 

@@ -0,0 +1,183 @@
+# Scaling Recommendations
+
+This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker deployment version on Kubernetes, and recommends which tool fits which workload pattern.
+
+The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits.
+
+## TL;DR
+
+We recommend choosing a scaler approach that aligns with the workload pattern your application exhibits.
+
+| Workload pattern | Recommendation |
+|------------------|----------------|
+| Continuous traffic (task queue always loaded) | HPA |
+| Idle periods >5 min between work OR needs scale-from-zero | KEDA Temporal scaler |
+| Required reactivity < ~60 s from first backlog | KEDA Temporal scaler |
+| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter |
+| 1000s of task queues and worker deployment versions  | HPA + prometheus-adapter |
+
+## HPA scaling signal
+
+This section describes the signal used by HPA + prometheus adapter to adjust the count of workers in a Kubernetes deployment managed by Temporal Worker Controller.
+
+There are two metric data points that are scraped by HPA + prometheus adapter.
+
+`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it.
+
+`temporal_slot_utilization` (or just "slot util") is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects the current state of a particular worker. This metrics rises *before* backlog accumulates — slots saturate first, then queueing starts.
+
+For a continuously-loaded task queue, the end-to-end delay from "backlog appears" to "HPA scales up" decomposes as:
+
+```
+backlog appears at T0
+  └─ Temporal Cloud OpenMetrics emission cadence    +~60 s worst-case  (~1 sample/minute)
+       └─ Prometheus scrape interval                 +~10 s
+            └─ HPA poll interval                     +~15 s
+                 └─ scale-up stabilization window    +~your config
+                      └─ first replica added
+```
+
+**Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence.
+
+### Caveat: gateway delivery delay
+
+During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.
+
+Once the delay resolved, the gateway delivered the missing samples with their original minute-aligned timestamps in a burst, so Prometheus's storage ends up with a complete 1/minute series in retrospect. We verified this directly: across a 3-hour window covering one such delay event, every gap between consecutive sample timestamps was exactly 60 seconds, no exceptions.
+
+The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.
+
+We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.
+
+This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
+
+### Slot utilization is a much faster leading signal
+
+`temporal_slot_utilization` is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects current state. It also rises *before* backlog accumulates — slots saturate first, then queueing starts. So a two-metric HPA with both slot util and backlog gives you fast scale-up via slot util and a backlog-driven backstop.
+
+The demo HPA uses both. For production scaling we recommend keeping both as well.
+
+## When backlog metric goes silent
+
+Two distinct failure modes that look similar in HPA events but have different meanings:
+
+### Mode 1: adapter-level deregistration (rare)
+- Trigger: prometheus-adapter pod restart, or *no* series matching the rule's `seriesQuery` exist in Prometheus.
+- Symptom in HPA events: `the server could not find the metric ...`.
+- Recovery: up to one `metricsRelistInterval` after data flows again.
+
+prometheus-adapter periodically asks Prometheus "what series exist in the last `metricsRelistInterval`?" — see the [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md). If the discovery window is shorter than the longest gateway-wide stall, the discovery returns empty and the metric name disappears from the External Metrics API. The `metricsRelistInterval: 5m` setting buys margin: comfortably longer than typical sample age (~30s p50, ~50s p95) and longer than observed multi-minute gateway stalls so far.
+
+### Mode 2: series-level silence (common in low-traffic workloads)
+- Trigger: a task queue with no polls or new tasks for >5 minutes. Temporal unloads it from memory and stops emitting `temporal_cloud_v1_approximate_backlog_count` for that specific `(task_queue, build_id, ...)` labelset. Other queues' series continue to emit.
+- Symptom in HPA events: `no metrics returned from external metrics API`. The metric *name* is still registered; the HPA's specific label selector just matches zero rows now.
+- Recovery: traffic resumes → queue reloads → next emission cycle (~1 min) + 3-min aggregation lag → HPA can read value again.
+
+In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events.
+
+## Why this demo does not use a backlog recording rule
+
+A prior version of this demo wrapped the raw Temporal Cloud series in a Prometheus recording rule:
+
+```yaml
+- record: temporal_approximate_backlog_count
+  expr: sum by (...) (temporal_cloud_v1_approximate_backlog_count)
+```
+
+The rule was originally added to work around a label-formatting issue in an older Temporal Cloud release. With native per-version labels (`temporal_worker_deployment_name`, `temporal_worker_build_id`) now opt-in, the rule no longer earns its keep:
+
+- **It doesn't reduce reactivity.** The HPA reactivity floor is the upstream OpenMetrics emission cadence (~60s), not anything the rule could fix.
+- **It duplicates the cardinality bill.** Per-`(task_queue, build_id)` labels are already opt-in at the OpenMetrics level *because* of cardinality. Adding a recording rule on top means storing the same high-cardinality series twice.
+- **It hides a `sum(...)` that the adapter already does.** prometheus-adapter's `metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})` performs the same collapsing at query time. Pedagogically, "the adapter does the sum" is cleaner than "a recording rule sums first, then the adapter sums again."
+- **It does not solve series-level silence.** When the source goes silent (task queue unloaded), the rule output also goes silent eventually (once Prometheus's staleness lookback expires).
+
+What the recording rule *does* buy is registration stability after operational events: when the source series is sparse-by-timestamp, the rule produces a dense 10-second sample stream that lets the adapter discover with a tight `metricsRelistInterval`. If you find yourself fighting registration flicker on every adapter restart and would rather pay the cardinality cost than tune `metricsRelistInterval`, a recording rule is a reasonable choice. Otherwise, prefer the raw metric.
+
+In this demo we set `metricsRelistInterval: 5m` and consume the raw metric directly.
+## HPA strengths
+
+Because HPA uses a single OpenMetrics scrape to gather all series for the namespace in a single HTTP request, the HPA approach scales independently of namespace count. The single HTTP request for OpenMetrics more efficient than KEDA's Temporal API-based approach, and will not run into Temporal API rate limiting problems (see section below on [KEDA limitations](#keda-limitations)).
+
+HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.
+## HPA limitations
+
+This section describes two known limitations for HPA + prometheus adapter.
+
+Temporal Cloud's OpenMetrics endpoint may sometimes return the same embedded timestamps on repeated scrapes for each series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace. This delay in returning fresh metrics data can impact the speed to which HPA + prometheus adapter scales out or in the replica count for a worker deployment version. This means that HPA + prometheus adapter may not be a good solution if your workload cannot tolerate occasional multi-minute scaling pauses.
+
+> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
+
+HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.
+
+1. Zero workers means no polls.
+2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory.
+3. An unloaded queue emits no metric.
+4. Adapter discovery returns no series, or HPA queries return no rows.
+5. HPA cannot scale up because there's no signal to scale on.
-1. Zero workers means no polls.
-2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory.
-3. An unloaded queue emits no metric.
-4. Adapter discovery returns no series, or HPA queries return no rows.
-5. HPA cannot scale up because there's no signal to scale on.
+In addition to the "first worker start" problem, for customers using Temporal Cloud, if there are no polling workers for a task queue for more than 5 minutes, Temporal Cloud will unload the task queue from memory. Unloaded task queues do not emit metrics, and therefore the signal that HPA uses to scale up will not be present.
-1. Zero workers means no polls.
-2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory.
-3. An unloaded queue emits no metric.
-4. Adapter discovery returns no series, or HPA queries return no rows.
-5. HPA cannot scale up because there's no signal to scale on.
+In addition to the "first worker start" problem, for customers using Temporal Cloud, if there are no polling workers for a task queue for more than 5 minutes, Temporal Cloud will unload the task queue from memory. Unloaded task queues do not emit metrics, and therefore the signal that HPA uses to scale up will not be present.
+
+Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute). By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work.
+
+## KEDA strengths
+
+KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. This allows KEDA to scale Temporal workers from zero.
+
+## KEDA limitations
+
+KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:
+
+```
+FrontendGlobalWorkerDeploymentReadRPS = 50  # per namespace, evenly distributed across frontend instances
+```
+
+For a namespace with N task queues × M worker-deployment-versions = K HPAs, each KEDA poll uses ~1 API call. The polling budget:
+
+| HPA count | Poll every 30s | Poll every 10s | Poll every 5s |
+|-----------|----------------|----------------|---------------|
+| 50        | 1.7 RPS (3%)   | 5 RPS (10%)    | 10 RPS (20%)  |
+| 250       | 8 RPS (17%)    | 25 RPS (50%)   | 50 RPS (100%) |
+| 1500      | 50 RPS (100%)  | exceeds limit  | exceeds limit |
+
+
+If you are using KEDA with Temporal Cloud and hitting the API rate limit described above, you will need to contact your Temporal Cloud account team to discuss increasing the rate limits.
+
+## Recommended configuration for the HPA + prometheus-adapter path
+
+This demo's configuration represents the recommendation, in compact form:
+
+**Scrape config** (`internal/demo/k8s/prometheus-stack-values.yaml`):
+```yaml
+- job_name: temporal_cloud
+  scrape_interval: 10s
+  honor_timestamps: true
+  metrics_path: /v1/metrics
+  params:
+    labels:
+      - temporal_worker_deployment_name
+      - temporal_worker_build_id
+```
+
+**prometheus-adapter rule** (`internal/demo/k8s/prometheus-adapter-values.yaml`):
+```yaml
+metricsRelistInterval: 5m   # must accommodate Cloud's ~3-min embedded-timestamp lag
+rules:
+  external:
+    - seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}'
+      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
+      name:
+        as: "temporal_cloud_v1_approximate_backlog_count"
+      resources:
+        namespaced: false
+```
+
+The `seriesQuery` filter excludes `__unversioned__` series. Without it, accounts with many unversioned namespaces produce 5000+ series in the discovery response, which slows or breaks adapter discovery. The filter scopes discovery to versioned workloads — exactly the ones HPAs need.
+
+**HPA template** (`examples/wrt-hpa-backlog.yaml`): two metrics — slot utilization (fast leading signal, scale-up gate) and backlog count (confirming signal, AverageValue target).
+
+## References
+
+- [Temporal Cloud OpenMetrics](https://docs.temporal.io/cloud/metrics/openmetrics) — endpoint and opt-in labels
+- [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md) — `metrics-relist-interval` and discovery window semantics
+- [prometheus-adapter externalmetrics.md](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md) — external rules, `namespaced: false` for cluster-scoped metrics
+- [Prometheus HTTP API: `/api/v1/series`](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) — series discovery semantics
+- [Prometheus scrape config: `honor_timestamps`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) — preserving source timestamps
+- [KEDA Temporal scaler](https://keda.sh/docs/latest/scalers/temporal/) — direct API polling alternative
@@ -61,15 +61,16 @@ spec:
               value: "750m"
 
         # Metric: backlog count — scale up when tasks are queued but not yet picked up.
-        # temporal_approximate_backlog_count is a recording rule that aggregates
-        # temporal_cloud_v1_approximate_backlog_count down to the four labels the HPA needs.
+        # Sourced directly from Temporal Cloud's temporal_cloud_v1_approximate_backlog_count
+        # series; the prometheus-adapter rule wraps it in sum(...) to collapse labels the HPA
+        # doesn't select on (instance/job/region/task_priority/temporal_account).
         # temporal_worker_deployment_name, temporal_worker_build_id, and temporal_namespace
         # are injected automatically by the controller — do not set them here.
         # temporal_task_queue must be set explicitly to scope the metric to your task queue.
         - type: External
           external:
             metric:
-              name: temporal_approximate_backlog_count
+              name: temporal_cloud_v1_approximate_backlog_count
               selector:
                 matchLabels:
                   temporal_task_queue: "default_helloworld"

@@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/
 
 This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection.
 
-> **Note:** This demo ships a Prometheus recording rule that renames `temporal_cloud_v1_approximate_backlog_count` to `temporal_approximate_backlog_count` and reduces it to the labels the HPA cares about. In principle the HPA can consume the raw Cloud metric directly (set `namespaced: false` on the prometheus-adapter rule so it doesn't auto-inject a `namespace` label filter), but this demo uses the recording rule as a known-working path.
+> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about an account-wide OpenMetrics delivery-delay pattern we observed during testing (retrospectively backfilled, but real for live HPA queries).
 
 **Step 1 — Create the Temporal Cloud metrics credentials secret.**
 
@@ -302,11 +302,11 @@ helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapte
 
 ```bash
 kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9092:9090 &
-curl -s 'http://localhost:9092/api/v1/query?query=temporal_approximate_backlog_count' \
+curl -s 'http://localhost:9092/api/v1/query?query=temporal_cloud_v1_approximate_backlog_count' \
   | jq '.data.result'
 ```
 
-You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, wait 15–30s for the recording rule to evaluate.
+You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, verify the Temporal Cloud metrics API key secret is correct and that scrape targets are healthy in the Prometheus UI.
 
 **Step 4 — Apply the combined WRT.**
 ```bash