Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ See [Migration to Unversioned](migration-to-unversioned.md) for how to migrate b
### [Ownership](manager-identity.md)
How the controller gets permission to manage a Worker Deployment, how a human client can take or give back control.

### [Scaling Recommendations](scaling-recommendations.md)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding this new doc to the list here.

Practical reactivity and reliability tradeoffs between HPA + prometheus-adapter and KEDA when scaling Temporal workers per worker-deployment-version. Covers steady-state reactivity (~3:15 via the metric path), task-queue unloading, scale-from-zero limits, and when to pick which tool.

### [WorkerResourceTemplate](worker-resource-templates.md)
How to attach HPAs, PodDisruptionBudgets, and other Kubernetes resources to each active versioned Deployment. Covers the auto-injection model, RBAC setup, webhook TLS, and examples.

Expand Down
183 changes: 183 additions & 0 deletions docs/scaling-recommendations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Scaling Recommendations

This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker deployment version on Kubernetes, and recommends which tool fits which workload pattern.

The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits.

## TL;DR

We recommend choosing a scaler approach that aligns with the workload pattern your application exhibits.

| Workload pattern | Recommendation |
|------------------|----------------|
| Continuous traffic (task queue always loaded) | HPA |
| Idle periods >5 min between work OR needs scale-from-zero | KEDA Temporal scaler |
| Required reactivity < ~60 s from first backlog | KEDA Temporal scaler |
| Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter |
| 1000s of task queues and worker deployment versions | HPA + prometheus-adapter |

## HPA scaling signal

This section describes the signal used by HPA + prometheus adapter to adjust the count of workers in a Kubernetes deployment managed by Temporal Worker Controller.

There are two metric data points that are scraped by HPA + prometheus adapter.

`temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it.

`temporal_slot_utilization` (or just "slot util") is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects the current state of a particular worker. This metrics rises *before* backlog accumulates — slots saturate first, then queueing starts.

For a continuously-loaded task queue, the end-to-end delay from "backlog appears" to "HPA scales up" decomposes as:

```
backlog appears at T0
└─ Temporal Cloud OpenMetrics emission cadence +~60 s worst-case (~1 sample/minute)
└─ Prometheus scrape interval +~10 s
└─ HPA poll interval +~15 s
└─ scale-up stabilization window +~your config
└─ first replica added
```

**Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "your stabilization window"? That is not defined anywhere and this reads like Claude has seen the phrase "stabilization window" in other docs that it scraped from the Internet that described auto-scaling algorithms but doesn't actually understand what "stabilization window" means.

Also, "Typical end-to-end reactivity" doesn't make sense here and sounds like a term Claude either made up or has hallucinated-adopted from the term "end-to-end reactivity" from frontend software development patterns.

Copy link
Copy Markdown
Collaborator Author

@carlydf carlydf May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about any Claude-isms, the reason this is in Draft mode is because I hadn't done a full pass over it yet. I care deeply about not putting PRs in front of people that I haven't reviewed myself and don't necessarily endorse any of this until it's out of draft mode.

Claude probably came up with this after seeing a bunch of grafana screenshots that I sent it.


### Caveat: gateway delivery delay

During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads odd given that the project is from the same people as Temporal Cloud - can this be reworded?


Once the delay resolved, the gateway delivered the missing samples with their original minute-aligned timestamps in a burst, so Prometheus's storage ends up with a complete 1/minute series in retrospect. We verified this directly: across a 3-hour window covering one such delay event, every gap between consecutive sample timestamps was exactly 60 seconds, no exceptions.

The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in.

We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like sharing internal sausage making - which as a customer I am not sure what to take away from

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, which is why I recommended in my review that this entire section be removed :)


This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.
Comment on lines +42 to +52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing this section. I've moved some of the content and reworded it in the ## HPA limitations section below.


### Slot utilization is a much faster leading signal

`temporal_slot_utilization` is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects current state. It also rises *before* backlog accumulates — slots saturate first, then queueing starts. So a two-metric HPA with both slot util and backlog gives you fast scale-up via slot util and a backlog-driven backstop.

The demo HPA uses both. For production scaling we recommend keeping both as well.
Comment on lines +54 to +58
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing this. I've pulled some of the content into the proposed new ## HPA strengths and ## HPA scaling signal sections


## When backlog metric goes silent

Two distinct failure modes that look similar in HPA events but have different meanings:

### Mode 1: adapter-level deregistration (rare)
- Trigger: prometheus-adapter pod restart, or *no* series matching the rule's `seriesQuery` exist in Prometheus.
- Symptom in HPA events: `the server could not find the metric ...`.
- Recovery: up to one `metricsRelistInterval` after data flows again.

prometheus-adapter periodically asks Prometheus "what series exist in the last `metricsRelistInterval`?" — see the [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md). If the discovery window is shorter than the longest gateway-wide stall, the discovery returns empty and the metric name disappears from the External Metrics API. The `metricsRelistInterval: 5m` setting buys margin: comfortably longer than typical sample age (~30s p50, ~50s p95) and longer than observed multi-minute gateway stalls so far.

### Mode 2: series-level silence (common in low-traffic workloads)
- Trigger: a task queue with no polls or new tasks for >5 minutes. Temporal unloads it from memory and stops emitting `temporal_cloud_v1_approximate_backlog_count` for that specific `(task_queue, build_id, ...)` labelset. Other queues' series continue to emit.
- Symptom in HPA events: `no metrics returned from external metrics API`. The metric *name* is still registered; the HPA's specific label selector just matches zero rows now.
- Recovery: traffic resumes → queue reloads → next emission cycle (~1 min) + 3-min aggregation lag → HPA can read value again.

In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events.

## Why this demo does not use a backlog recording rule
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which demo? "this" is an unclear reference

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ii recommended removing this entire section


A prior version of this demo wrapped the raw Temporal Cloud series in a Prometheus recording rule:

```yaml
- record: temporal_approximate_backlog_count
expr: sum by (...) (temporal_cloud_v1_approximate_backlog_count)
```

The rule was originally added to work around a label-formatting issue in an older Temporal Cloud release. With native per-version labels (`temporal_worker_deployment_name`, `temporal_worker_build_id`) now opt-in, the rule no longer earns its keep:

- **It doesn't reduce reactivity.** The HPA reactivity floor is the upstream OpenMetrics emission cadence (~60s), not anything the rule could fix.
- **It duplicates the cardinality bill.** Per-`(task_queue, build_id)` labels are already opt-in at the OpenMetrics level *because* of cardinality. Adding a recording rule on top means storing the same high-cardinality series twice.
- **It hides a `sum(...)` that the adapter already does.** prometheus-adapter's `metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})` performs the same collapsing at query time. Pedagogically, "the adapter does the sum" is cleaner than "a recording rule sums first, then the adapter sums again."
- **It does not solve series-level silence.** When the source goes silent (task queue unloaded), the rule output also goes silent eventually (once Prometheus's staleness lookback expires).
Comment on lines +78 to +92
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. It's not necessary and likely will just confuse the reader.


What the recording rule *does* buy is registration stability after operational events: when the source series is sparse-by-timestamp, the rule produces a dense 10-second sample stream that lets the adapter discover with a tight `metricsRelistInterval`. If you find yourself fighting registration flicker on every adapter restart and would rather pay the cardinality cost than tune `metricsRelistInterval`, a recording rule is a reasonable choice. Otherwise, prefer the raw metric.

In this demo we set `metricsRelistInterval: 5m` and consume the raw metric directly.
## HPA strengths

Comment thread
carlydf marked this conversation as resolved.
Because HPA uses a single OpenMetrics scrape to gather all series for the namespace in a single HTTP request, the HPA approach scales independently of namespace count. The single HTTP request for OpenMetrics more efficient than KEDA's Temporal API-based approach, and will not run into Temporal API rate limiting problems (see section below on [KEDA limitations](#keda-limitations)).

HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment.
## HPA limitations

This section describes two known limitations for HPA + prometheus adapter.

Temporal Cloud's OpenMetrics endpoint may sometimes return the same embedded timestamps on repeated scrapes for each series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace. This delay in returning fresh metrics data can impact the speed to which HPA + prometheus adapter scales out or in the replica count for a worker deployment version. This means that HPA + prometheus adapter may not be a good solution if your workload cannot tolerate occasional multi-minute scaling pauses.

> **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes.

HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue.

1. Zero workers means no polls.
2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory.
3. An unloaded queue emits no metric.
4. Adapter discovery returns no series, or HPA queries return no rows.
5. HPA cannot scale up because there's no signal to scale on.
Comment on lines +112 to +116
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Zero workers means no polls.
2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory.
3. An unloaded queue emits no metric.
4. Adapter discovery returns no series, or HPA queries return no rows.
5. HPA cannot scale up because there's no signal to scale on.
In addition to the "first worker start" problem, for customers using Temporal Cloud, if there are no polling workers for a task queue for more than 5 minutes, Temporal Cloud will unload the task queue from memory. Unloaded task queues do not emit metrics, and therefore the signal that HPA uses to scale up will not be present.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the same problem also seen with OSS?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure. @carlydf do you know the answer to this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, same problem for OSS and Cloud here, Claude hallucinating. (PR in draft mode because I had not fully reviewed yet, didn't intend to waste your energy on review until it was ready!)


Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute). By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work.

## KEDA strengths

KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. This allows KEDA to scale Temporal workers from zero.

## KEDA limitations

KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other KEDA limitation (for now) #355

KEDA also does not actually work with TWC until #286 is closed. Luckily we have an open community PR #351 that will add support for the KEDA temporal trigger, which I think can be merged soon. I just reviewed it.


```
FrontendGlobalWorkerDeploymentReadRPS = 50 # per namespace, evenly distributed across frontend instances
```

For a namespace with N task queues × M worker-deployment-versions = K HPAs, each KEDA poll uses ~1 API call. The polling budget:

| HPA count | Poll every 30s | Poll every 10s | Poll every 5s |
|-----------|----------------|----------------|---------------|
| 50 | 1.7 RPS (3%) | 5 RPS (10%) | 10 RPS (20%) |
| 250 | 8 RPS (17%) | 25 RPS (50%) | 50 RPS (100%) |
| 1500 | 50 RPS (100%) | exceeds limit | exceeds limit |


If you are using KEDA with Temporal Cloud and hitting the API rate limit described above, you will need to contact your Temporal Cloud account team to discuss increasing the rate limits.

## Recommended configuration for the HPA + prometheus-adapter path

This demo's configuration represents the recommendation, in compact form:

**Scrape config** (`internal/demo/k8s/prometheus-stack-values.yaml`):
```yaml
- job_name: temporal_cloud
scrape_interval: 10s
honor_timestamps: true
metrics_path: /v1/metrics
params:
labels:
- temporal_worker_deployment_name
- temporal_worker_build_id
```

**prometheus-adapter rule** (`internal/demo/k8s/prometheus-adapter-values.yaml`):
```yaml
metricsRelistInterval: 5m # must accommodate Cloud's ~3-min embedded-timestamp lag
rules:
external:
- seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}'
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})'
name:
as: "temporal_cloud_v1_approximate_backlog_count"
resources:
namespaced: false
```

The `seriesQuery` filter excludes `__unversioned__` series. Without it, accounts with many unversioned namespaces produce 5000+ series in the discovery response, which slows or breaks adapter discovery. The filter scopes discovery to versioned workloads — exactly the ones HPAs need.

**HPA template** (`examples/wrt-hpa-backlog.yaml`): two metrics — slot utilization (fast leading signal, scale-up gate) and backlog count (confirming signal, AverageValue target).

## References

- [Temporal Cloud OpenMetrics](https://docs.temporal.io/cloud/metrics/openmetrics) — endpoint and opt-in labels
- [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md) — `metrics-relist-interval` and discovery window semantics
- [prometheus-adapter externalmetrics.md](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md) — external rules, `namespaced: false` for cluster-scoped metrics
- [Prometheus HTTP API: `/api/v1/series`](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) — series discovery semantics
- [Prometheus scrape config: `honor_timestamps`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) — preserving source timestamps
- [KEDA Temporal scaler](https://keda.sh/docs/latest/scalers/temporal/) — direct API polling alternative
7 changes: 4 additions & 3 deletions examples/wrt-hpa-backlog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,16 @@ spec:
value: "750m"

# Metric: backlog count — scale up when tasks are queued but not yet picked up.
# temporal_approximate_backlog_count is a recording rule that aggregates
# temporal_cloud_v1_approximate_backlog_count down to the four labels the HPA needs.
# Sourced directly from Temporal Cloud's temporal_cloud_v1_approximate_backlog_count
# series; the prometheus-adapter rule wraps it in sum(...) to collapse labels the HPA
# doesn't select on (instance/job/region/task_priority/temporal_account).
# temporal_worker_deployment_name, temporal_worker_build_id, and temporal_namespace
# are injected automatically by the controller — do not set them here.
# temporal_task_queue must be set explicitly to scope the metric to your task queue.
- type: External
external:
metric:
name: temporal_approximate_backlog_count
name: temporal_cloud_v1_approximate_backlog_count
selector:
matchLabels:
temporal_task_queue: "default_helloworld"
Expand Down
6 changes: 3 additions & 3 deletions internal/demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ You'll also need to [opt-in](https://docs.temporal.io/cloud/metrics/openmetrics/

This requires a **metrics API key** — a separate credential from the namespace API key used for the worker connection.

> **Note:** This demo ships a Prometheus recording rule that renames `temporal_cloud_v1_approximate_backlog_count` to `temporal_approximate_backlog_count` and reduces it to the labels the HPA cares about. In principle the HPA can consume the raw Cloud metric directly (set `namespaced: false` on the prometheus-adapter rule so it doesn't auto-inject a `namespace` label filter), but this demo uses the recording rule as a known-working path.
> **Picking a scaling tool for your workload:** This demo uses the HPA + prometheus-adapter path. It works well for continuously-loaded task queues and has a typical end-to-end reactivity of ~85 seconds (dominated by Temporal Cloud's ~1/minute OpenMetrics emission cadence). It cannot do scale-from-zero. For sub-60s reactivity or scale-from-zero, use the KEDA Temporal scaler. See [docs/scaling-recommendations.md](../../docs/scaling-recommendations.md) for the full reactivity model, when to pick which, and a caveat about an account-wide OpenMetrics delivery-delay pattern we observed during testing (retrospectively backfilled, but real for live HPA queries).

**Step 1 — Create the Temporal Cloud metrics credentials secret.**

Expand Down Expand Up @@ -302,11 +302,11 @@ helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapte

```bash
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9092:9090 &
curl -s 'http://localhost:9092/api/v1/query?query=temporal_approximate_backlog_count' \
curl -s 'http://localhost:9092/api/v1/query?query=temporal_cloud_v1_approximate_backlog_count' \
| jq '.data.result'
```

You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, wait 15–30s for the recording rule to evaluate.
You should see results with `temporal_worker_deployment_name` and `temporal_worker_build_id` labels. If the result is empty, verify the Temporal Cloud metrics API key secret is correct and that scrape targets are healthy in the Prometheus UI.

**Step 4 — Apply the combined WRT.**
```bash
Expand Down
Loading
Loading