-
Notifications
You must be signed in to change notification settings - Fork 42
Drop backlog recording rule; consume raw temporal_cloud_v1_approximate_backlog_count #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
5c252da
fbf4c57
16d5692
26e953b
4f72016
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,183 @@ | ||||||||||||||
| # Scaling Recommendations | ||||||||||||||
|
|
||||||||||||||
| This document describes practical reactivity and reliability tradeoffs when scaling Temporal workers per worker deployment version on Kubernetes, and recommends which tool fits which workload pattern. | ||||||||||||||
|
|
||||||||||||||
| The `internal/demo/` example wires the HPA path described here. The KEDA path is mentioned for comparison and as a recommendation for workloads that cannot tolerate the HPA path's limits. | ||||||||||||||
|
|
||||||||||||||
| ## TL;DR | ||||||||||||||
|
|
||||||||||||||
| We recommend choosing a scaler approach that aligns with the workload pattern your application exhibits. | ||||||||||||||
|
|
||||||||||||||
| | Workload pattern | Recommendation | | ||||||||||||||
| |------------------|----------------| | ||||||||||||||
| | Continuous traffic (task queue always loaded) | HPA | | ||||||||||||||
| | Idle periods >5 min between work OR needs scale-from-zero | KEDA Temporal scaler | | ||||||||||||||
| | Required reactivity < ~60 s from first backlog | KEDA Temporal scaler | | ||||||||||||||
| | Required reactivity ~90 s typical, tolerant of occasional multi-minute stalls | HPA + prometheus-adapter | | ||||||||||||||
| | 1000s of task queues and worker deployment versions | HPA + prometheus-adapter | | ||||||||||||||
|
|
||||||||||||||
| ## HPA scaling signal | ||||||||||||||
|
|
||||||||||||||
| This section describes the signal used by HPA + prometheus adapter to adjust the count of workers in a Kubernetes deployment managed by Temporal Worker Controller. | ||||||||||||||
|
|
||||||||||||||
| There are two metric data points that are scraped by HPA + prometheus adapter. | ||||||||||||||
|
|
||||||||||||||
| `temporal_cloud_v1_approximate_backlog_count` (or just "backlog") is a measurement of the number of pending tasks on a particular task queue that are waiting for a poller (a worker) to pull that task and process it. | ||||||||||||||
|
|
||||||||||||||
| `temporal_slot_utilization` (or just "slot util") is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects the current state of a particular worker. This metrics rises *before* backlog accumulates — slots saturate first, then queueing starts. | ||||||||||||||
|
|
||||||||||||||
| For a continuously-loaded task queue, the end-to-end delay from "backlog appears" to "HPA scales up" decomposes as: | ||||||||||||||
|
|
||||||||||||||
| ``` | ||||||||||||||
| backlog appears at T0 | ||||||||||||||
| └─ Temporal Cloud OpenMetrics emission cadence +~60 s worst-case (~1 sample/minute) | ||||||||||||||
| └─ Prometheus scrape interval +~10 s | ||||||||||||||
| └─ HPA poll interval +~15 s | ||||||||||||||
| └─ scale-up stabilization window +~your config | ||||||||||||||
| └─ first replica added | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| **Typical end-to-end reactivity is ≈ 85 seconds + your stabilization window.** Empirically, sample age in Prometheus for a single series follows a sawtooth between 0 and 60 seconds (matching the gateway's ~1/min emission cadence). p50 sample age ≈ 30s, p95 ≈ 50s. The 60-second emission cadence is the inherent floor — smaller scrape intervals, tighter `metricsRelistInterval`, or recording rules cannot improve it because they all consume the same upstream cadence. | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is "your stabilization window"? That is not defined anywhere and this reads like Claude has seen the phrase "stabilization window" in other docs that it scraped from the Internet that described auto-scaling algorithms but doesn't actually understand what "stabilization window" means. Also, "Typical end-to-end reactivity" doesn't make sense here and sounds like a term Claude either made up or has hallucinated-adopted from the term "end-to-end reactivity" from frontend software development patterns.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry about any Claude-isms, the reason this is in Draft mode is because I hadn't done a full pass over it yet. I care deeply about not putting PRs in front of people that I haven't reviewed myself and don't necessarily endorse any of this until it's out of draft mode. Claude probably came up with this after seeing a bunch of grafana screenshots that I sent it. |
||||||||||||||
|
|
||||||||||||||
| ### Caveat: gateway delivery delay | ||||||||||||||
|
|
||||||||||||||
| During our investigation we observed periods of several minutes during which Temporal Cloud's OpenMetrics endpoint returned the same embedded timestamps on repeated scrapes for *every* series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace, all showing identical staleness to the second (e.g. all ~30 visible series reading 239s old at once). The Prometheus scrape continued to succeed (`up{job="temporal_cloud"}` stayed 1, HTTP 200 responses) — the response body simply repeated already-known samples instead of advancing. | ||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this reads odd given that the project is from the same people as Temporal Cloud - can this be reworded? |
||||||||||||||
|
|
||||||||||||||
| Once the delay resolved, the gateway delivered the missing samples with their original minute-aligned timestamps in a burst, so Prometheus's storage ends up with a complete 1/minute series in retrospect. We verified this directly: across a 3-hour window covering one such delay event, every gap between consecutive sample timestamps was exactly 60 seconds, no exceptions. | ||||||||||||||
|
|
||||||||||||||
| The retrospective completeness is helpful for dashboards and post-hoc analysis, but it does **not** help an HPA, which queries the *latest available* value at decision time. During a delivery delay, the latest available sample is the one from before the delay started. The HPA sees real staleness even though the underlying record will eventually be filled in. | ||||||||||||||
|
|
||||||||||||||
| We have only directly characterized this pattern during one investigation session (seeing it twice in ~2 hours of close observation). Frequency in normal operation is not yet known and is open with Temporal's Observability team. If your workload cannot tolerate occasional multi-minute scaling pauses, prefer KEDA. | ||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this sounds like sharing internal sausage making - which as a customer I am not sure what to take away from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, which is why I recommended in my review that this entire section be removed :) |
||||||||||||||
|
|
||||||||||||||
| This is also why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes. | ||||||||||||||
|
Comment on lines
+42
to
+52
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Recommend removing this section. I've moved some of the content and reworded it in the |
||||||||||||||
|
|
||||||||||||||
| ### Slot utilization is a much faster leading signal | ||||||||||||||
|
|
||||||||||||||
| `temporal_slot_utilization` is emitted directly by worker pods (no Temporal Cloud aggregation), scraped at the ServiceMonitor interval (~10–30 s), and reflects current state. It also rises *before* backlog accumulates — slots saturate first, then queueing starts. So a two-metric HPA with both slot util and backlog gives you fast scale-up via slot util and a backlog-driven backstop. | ||||||||||||||
|
|
||||||||||||||
| The demo HPA uses both. For production scaling we recommend keeping both as well. | ||||||||||||||
|
Comment on lines
+54
to
+58
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Recommend removing this. I've pulled some of the content into the proposed new |
||||||||||||||
|
|
||||||||||||||
| ## When backlog metric goes silent | ||||||||||||||
|
|
||||||||||||||
| Two distinct failure modes that look similar in HPA events but have different meanings: | ||||||||||||||
|
|
||||||||||||||
| ### Mode 1: adapter-level deregistration (rare) | ||||||||||||||
| - Trigger: prometheus-adapter pod restart, or *no* series matching the rule's `seriesQuery` exist in Prometheus. | ||||||||||||||
| - Symptom in HPA events: `the server could not find the metric ...`. | ||||||||||||||
| - Recovery: up to one `metricsRelistInterval` after data flows again. | ||||||||||||||
|
|
||||||||||||||
| prometheus-adapter periodically asks Prometheus "what series exist in the last `metricsRelistInterval`?" — see the [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md). If the discovery window is shorter than the longest gateway-wide stall, the discovery returns empty and the metric name disappears from the External Metrics API. The `metricsRelistInterval: 5m` setting buys margin: comfortably longer than typical sample age (~30s p50, ~50s p95) and longer than observed multi-minute gateway stalls so far. | ||||||||||||||
|
|
||||||||||||||
| ### Mode 2: series-level silence (common in low-traffic workloads) | ||||||||||||||
| - Trigger: a task queue with no polls or new tasks for >5 minutes. Temporal unloads it from memory and stops emitting `temporal_cloud_v1_approximate_backlog_count` for that specific `(task_queue, build_id, ...)` labelset. Other queues' series continue to emit. | ||||||||||||||
| - Symptom in HPA events: `no metrics returned from external metrics API`. The metric *name* is still registered; the HPA's specific label selector just matches zero rows now. | ||||||||||||||
| - Recovery: traffic resumes → queue reloads → next emission cycle (~1 min) + 3-min aggregation lag → HPA can read value again. | ||||||||||||||
|
|
||||||||||||||
| In a two-metric HPA configured with slot utilization, this is mostly fine: the HPA reports `ScalingActive=True` based on slot utilization while backlog is unavailable, and rejoins backlog scaling once it returns. We've confirmed this empirically in this demo cluster — the HPA continued scaling correctly on slot utilization through 1000+ backlog `FailedGetExternalMetric` events. | ||||||||||||||
|
|
||||||||||||||
| ## Why this demo does not use a backlog recording rule | ||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. which demo? "this" is an unclear reference
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ii recommended removing this entire section |
||||||||||||||
|
|
||||||||||||||
| A prior version of this demo wrapped the raw Temporal Cloud series in a Prometheus recording rule: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| - record: temporal_approximate_backlog_count | ||||||||||||||
| expr: sum by (...) (temporal_cloud_v1_approximate_backlog_count) | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| The rule was originally added to work around a label-formatting issue in an older Temporal Cloud release. With native per-version labels (`temporal_worker_deployment_name`, `temporal_worker_build_id`) now opt-in, the rule no longer earns its keep: | ||||||||||||||
|
|
||||||||||||||
| - **It doesn't reduce reactivity.** The HPA reactivity floor is the upstream OpenMetrics emission cadence (~60s), not anything the rule could fix. | ||||||||||||||
| - **It duplicates the cardinality bill.** Per-`(task_queue, build_id)` labels are already opt-in at the OpenMetrics level *because* of cardinality. Adding a recording rule on top means storing the same high-cardinality series twice. | ||||||||||||||
| - **It hides a `sum(...)` that the adapter already does.** prometheus-adapter's `metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})` performs the same collapsing at query time. Pedagogically, "the adapter does the sum" is cleaner than "a recording rule sums first, then the adapter sums again." | ||||||||||||||
| - **It does not solve series-level silence.** When the source goes silent (task queue unloaded), the rule output also goes silent eventually (once Prometheus's staleness lookback expires). | ||||||||||||||
|
Comment on lines
+78
to
+92
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove this. It's not necessary and likely will just confuse the reader. |
||||||||||||||
|
|
||||||||||||||
| What the recording rule *does* buy is registration stability after operational events: when the source series is sparse-by-timestamp, the rule produces a dense 10-second sample stream that lets the adapter discover with a tight `metricsRelistInterval`. If you find yourself fighting registration flicker on every adapter restart and would rather pay the cardinality cost than tune `metricsRelistInterval`, a recording rule is a reasonable choice. Otherwise, prefer the raw metric. | ||||||||||||||
|
|
||||||||||||||
| In this demo we set `metricsRelistInterval: 5m` and consume the raw metric directly. | ||||||||||||||
| ## HPA strengths | ||||||||||||||
|
|
||||||||||||||
|
carlydf marked this conversation as resolved.
|
||||||||||||||
| Because HPA uses a single OpenMetrics scrape to gather all series for the namespace in a single HTTP request, the HPA approach scales independently of namespace count. The single HTTP request for OpenMetrics more efficient than KEDA's Temporal API-based approach, and will not run into Temporal API rate limiting problems (see section below on [KEDA limitations](#keda-limitations)). | ||||||||||||||
|
|
||||||||||||||
| HPA + prometheus adapter configured to look at both slot util and backlog provides fast scale-up via slot util and a backlog-driven backstop to prevent overly reactive replica count adjustment. | ||||||||||||||
| ## HPA limitations | ||||||||||||||
|
|
||||||||||||||
| This section describes two known limitations for HPA + prometheus adapter. | ||||||||||||||
|
|
||||||||||||||
| Temporal Cloud's OpenMetrics endpoint may sometimes return the same embedded timestamps on repeated scrapes for each series across the account simultaneously — backlog series, action counts, error counts, every queue, every namespace. This delay in returning fresh metrics data can impact the speed to which HPA + prometheus adapter scales out or in the replica count for a worker deployment version. This means that HPA + prometheus adapter may not be a good solution if your workload cannot tolerate occasional multi-minute scaling pauses. | ||||||||||||||
|
|
||||||||||||||
| > **Note**: This is why `metricsRelistInterval: 5m` is the recommended setting: the discovery window must comfortably exceed the longest expected delay so the metric does not deregister, otherwise re-registration waits up to one more relist cycle after delivery resumes. | ||||||||||||||
|
|
||||||||||||||
| HPA cannot scale your Worker Deployment from zero because the signal for scaling does not yet exist. The signal for scaling is the backlog metric for the task queue associated with the workers in the Worker Deployment. This metric will not exist until there is at least one worker polling the task queue. | ||||||||||||||
|
|
||||||||||||||
| 1. Zero workers means no polls. | ||||||||||||||
| 2. No polls for >5 minutes means the task queue is unloaded from Temporal Cloud's memory. | ||||||||||||||
| 3. An unloaded queue emits no metric. | ||||||||||||||
| 4. Adapter discovery returns no series, or HPA queries return no rows. | ||||||||||||||
| 5. HPA cannot scale up because there's no signal to scale on. | ||||||||||||||
|
Comment on lines
+112
to
+116
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the same problem also seen with OSS?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure. @carlydf do you know the answer to this?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, same problem for OSS and Cloud here, Claude hallucinating. (PR in draft mode because I had not fully reviewed yet, didn't intend to waste your energy on review until it was ready!) |
||||||||||||||
|
|
||||||||||||||
| Submitting a workflow does load the task queue back into memory, but the metric still won't reach the HPA until the next OpenMetrics emission cycle (~1 minute). By the time the HPA reacts, you've already had ~1+ minute of unprovisioned work. | ||||||||||||||
|
|
||||||||||||||
| ## KEDA strengths | ||||||||||||||
|
|
||||||||||||||
| KEDA's Temporal scaler calls `DescribeTaskQueue(stats=true)` (or `DescribeWorkerDeploymentVersion`), which loads the queue synchronously and returns the backlog directly. This allows KEDA to scale Temporal workers from zero. | ||||||||||||||
|
|
||||||||||||||
| ## KEDA limitations | ||||||||||||||
|
|
||||||||||||||
| KEDA bypasses the metric pipeline but uses Temporal API calls, which are subject to a per-namespace rate limit: | ||||||||||||||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||
|
|
||||||||||||||
| ``` | ||||||||||||||
| FrontendGlobalWorkerDeploymentReadRPS = 50 # per namespace, evenly distributed across frontend instances | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| For a namespace with N task queues × M worker-deployment-versions = K HPAs, each KEDA poll uses ~1 API call. The polling budget: | ||||||||||||||
|
|
||||||||||||||
| | HPA count | Poll every 30s | Poll every 10s | Poll every 5s | | ||||||||||||||
| |-----------|----------------|----------------|---------------| | ||||||||||||||
| | 50 | 1.7 RPS (3%) | 5 RPS (10%) | 10 RPS (20%) | | ||||||||||||||
| | 250 | 8 RPS (17%) | 25 RPS (50%) | 50 RPS (100%) | | ||||||||||||||
| | 1500 | 50 RPS (100%) | exceeds limit | exceeds limit | | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| If you are using KEDA with Temporal Cloud and hitting the API rate limit described above, you will need to contact your Temporal Cloud account team to discuss increasing the rate limits. | ||||||||||||||
|
|
||||||||||||||
| ## Recommended configuration for the HPA + prometheus-adapter path | ||||||||||||||
|
|
||||||||||||||
| This demo's configuration represents the recommendation, in compact form: | ||||||||||||||
|
|
||||||||||||||
| **Scrape config** (`internal/demo/k8s/prometheus-stack-values.yaml`): | ||||||||||||||
| ```yaml | ||||||||||||||
| - job_name: temporal_cloud | ||||||||||||||
| scrape_interval: 10s | ||||||||||||||
| honor_timestamps: true | ||||||||||||||
| metrics_path: /v1/metrics | ||||||||||||||
| params: | ||||||||||||||
| labels: | ||||||||||||||
| - temporal_worker_deployment_name | ||||||||||||||
| - temporal_worker_build_id | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| **prometheus-adapter rule** (`internal/demo/k8s/prometheus-adapter-values.yaml`): | ||||||||||||||
| ```yaml | ||||||||||||||
| metricsRelistInterval: 5m # must accommodate Cloud's ~3-min embedded-timestamp lag | ||||||||||||||
| rules: | ||||||||||||||
| external: | ||||||||||||||
| - seriesQuery: 'temporal_cloud_v1_approximate_backlog_count{temporal_worker_build_id!="__unversioned__"}' | ||||||||||||||
| metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>})' | ||||||||||||||
| name: | ||||||||||||||
| as: "temporal_cloud_v1_approximate_backlog_count" | ||||||||||||||
| resources: | ||||||||||||||
| namespaced: false | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| The `seriesQuery` filter excludes `__unversioned__` series. Without it, accounts with many unversioned namespaces produce 5000+ series in the discovery response, which slows or breaks adapter discovery. The filter scopes discovery to versioned workloads — exactly the ones HPAs need. | ||||||||||||||
|
|
||||||||||||||
| **HPA template** (`examples/wrt-hpa-backlog.yaml`): two metrics — slot utilization (fast leading signal, scale-up gate) and backlog count (confirming signal, AverageValue target). | ||||||||||||||
|
|
||||||||||||||
| ## References | ||||||||||||||
|
|
||||||||||||||
| - [Temporal Cloud OpenMetrics](https://docs.temporal.io/cloud/metrics/openmetrics) — endpoint and opt-in labels | ||||||||||||||
| - [prometheus-adapter README](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/README.md) — `metrics-relist-interval` and discovery window semantics | ||||||||||||||
| - [prometheus-adapter externalmetrics.md](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md) — external rules, `namespaced: false` for cluster-scoped metrics | ||||||||||||||
| - [Prometheus HTTP API: `/api/v1/series`](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) — series discovery semantics | ||||||||||||||
| - [Prometheus scrape config: `honor_timestamps`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) — preserving source timestamps | ||||||||||||||
| - [KEDA Temporal scaler](https://keda.sh/docs/latest/scalers/temporal/) — direct API polling alternative | ||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding this new doc to the list here.