From faa6b3d371d4a3f299e7ca2c53c1ad73faf8e2da Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Wed, 13 May 2026 20:39:54 -0700 Subject: [PATCH 1/8] Start OOM doc draft --- .../pages/troubleshoot/memory-management.adoc | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 modules/sql/pages/troubleshoot/memory-management.adoc diff --git a/modules/sql/pages/troubleshoot/memory-management.adoc b/modules/sql/pages/troubleshoot/memory-management.adoc new file mode 100644 index 000000000..7cf7d6791 --- /dev/null +++ b/modules/sql/pages/troubleshoot/memory-management.adoc @@ -0,0 +1,53 @@ += Troubleshoot memory-related query cancellations +:description: Recover from query cancellations triggered by Redpanda SQL's automatic out-of-memory protection. +:page-topic-type: how-to + +// TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. +// Options: +// "Troubleshoot memory-related query cancellations" (current; matches Troubleshoot section voice) +// "Recover from OOM cancellation" (concise; uses internal term) +// Keep "Memory management" (matches current nav label but doesn't signal action) + +Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its configured limit. If your application sees the following error, your queries have hit this protection: + +[source,text] +---- +cancelled due to OOM prevention +---- + +// TODO: SME — confirm the exact client-facing error envelope. The string above is the error reason raised internally by the engine. Clients connecting through `psql` or a PostgreSQL driver typically receive it wrapped in a PostgreSQL error message. Confirm: +// - Is a SQLSTATE code set on this error? If so, which one? +// - Does the message reach the client verbatim, or is the wording different? + +Only queries running on the affected node at the time of reclamation are cancelled. Other nodes in the cluster continue to serve queries. The node resumes accepting new queries immediately after reclamation completes, so in most cases you can retry the failed query and it succeeds. + +== If the error keeps happening + +If queries are repeatedly cancelled with this error, the workload is consistently pressing a node against its memory limit. + +// TODO: SME — runbook depth. Confirm which of the following actions to recommend, and in what order. Suggested guidance to validate: +// - Reduce query concurrency on the affected workload. +// - Simplify the query — narrow the scan range, add filters, reduce parallel CTEs. +// - Scale up the cluster. +// Also confirm: is there a heuristic for choosing among them (for example, look at oxla_process_memory_total over time)? + +== Why this happens + +Redpanda SQL monitors each node's resident memory usage and triggers a brief reclamation phase when the node approaches its memory limit. During reclamation, the node cancels its running queries and frees memory so it can keep serving new queries. The protection runs on each node independently and is always on. There is no configuration option to enable, disable, or tune it. + +// TODO: SME — confirm whether `memory.max` and `memory.max_non_query` are exposed through the BYOC layer at GA. Per OXLA-9109, the configurable threshold was descoped before ship. If neither is exposed to users (even via support), this section stands as-is. If either is reachable (for example via a support-only path), note it here so users understand what controls exist. + +== Monitor memory usage + +Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: + +[cols="1,3"] +|=== +| Metric | Description + +| `oxla_process_memory_total` +| Process Resident Set Size (RSS) in bytes, reported per node. +|=== + +// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented. + From e314dc71a0dba4caf4352c6b0d6825756c408d2b Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:22:42 -0700 Subject: [PATCH 2/8] Review pass --- modules/ROOT/nav.adoc | 2 +- .../pages/troubleshoot/memory-management.adoc | 16 ++++++++++------ 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 416cb430a..a1c5d36e3 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -361,7 +361,7 @@ *** xref:sql:manage/manage-access.adoc[Manage access] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[Memory Management] +*** xref:sql:troubleshoot/memory-management.adoc[OOM cancellations] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] diff --git a/modules/sql/pages/troubleshoot/memory-management.adoc b/modules/sql/pages/troubleshoot/memory-management.adoc index 7cf7d6791..9694abc2b 100644 --- a/modules/sql/pages/troubleshoot/memory-management.adoc +++ b/modules/sql/pages/troubleshoot/memory-management.adoc @@ -1,6 +1,7 @@ = Troubleshoot memory-related query cancellations -:description: Recover from query cancellations triggered by Redpanda SQL's automatic out-of-memory protection. +:description: Recover from query cancellations triggered by Redpanda SQL's automatic OOM prevention. :page-topic-type: how-to +:personas: platform_admin, data_engineer // TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. // Options: @@ -8,7 +9,7 @@ // "Recover from OOM cancellation" (concise; uses internal term) // Keep "Memory management" (matches current nav label but doesn't signal action) -Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its configured limit. If your application sees the following error, your queries have hit this protection: +Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its memory limit. If your application sees the following error, your queries have hit this protection: [source,text] ---- @@ -23,7 +24,7 @@ Only queries running on the affected node at the time of reclamation are cancell == If the error keeps happening -If queries are repeatedly cancelled with this error, the workload is consistently pressing a node against its memory limit. +If queries are repeatedly cancelled with this error, the workload is consistently approaching the node's memory limit. // TODO: SME — runbook depth. Confirm which of the following actions to recommend, and in what order. Suggested guidance to validate: // - Reduce query concurrency on the affected workload. @@ -35,8 +36,6 @@ If queries are repeatedly cancelled with this error, the workload is consistentl Redpanda SQL monitors each node's resident memory usage and triggers a brief reclamation phase when the node approaches its memory limit. During reclamation, the node cancels its running queries and frees memory so it can keep serving new queries. The protection runs on each node independently and is always on. There is no configuration option to enable, disable, or tune it. -// TODO: SME — confirm whether `memory.max` and `memory.max_non_query` are exposed through the BYOC layer at GA. Per OXLA-9109, the configurable threshold was descoped before ship. If neither is exposed to users (even via support), this section stands as-is. If either is reachable (for example via a support-only path), note it here so users understand what controls exist. - == Monitor memory usage Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: @@ -49,5 +48,10 @@ Use the following Prometheus gauge to track each node's resident memory and watc | Process Resident Set Size (RSS) in bytes, reported per node. |=== -// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented. +// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented and add a cross-link from "Suggested reading" below to that page. + +== Suggested reading + +* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. +* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. From f3c7ea25a239046208b0d55e650e6957db2ee0f4 Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:26:45 -0700 Subject: [PATCH 3/8] Capitalization --- modules/ROOT/nav.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index a1c5d36e3..0762ff198 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -361,7 +361,7 @@ *** xref:sql:manage/manage-access.adoc[Manage access] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[OOM cancellations] +*** xref:sql:troubleshoot/memory-management.adoc[OOM Cancellations] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] From 63d6c8f5a0333b02fb4877c10aedffd63f65090e Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:28:21 -0700 Subject: [PATCH 4/8] Rename file --- modules/ROOT/nav.adoc | 2 +- .../{memory-management.adoc => oom-cancellations.adoc} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename modules/sql/pages/troubleshoot/{memory-management.adoc => oom-cancellations.adoc} (100%) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 0762ff198..e77a08d36 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -361,7 +361,7 @@ *** xref:sql:manage/manage-access.adoc[Manage access] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[OOM Cancellations] +*** xref:sql:troubleshoot/oom-cancellations.adoc[OOM Cancellations] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] diff --git a/modules/sql/pages/troubleshoot/memory-management.adoc b/modules/sql/pages/troubleshoot/oom-cancellations.adoc similarity index 100% rename from modules/sql/pages/troubleshoot/memory-management.adoc rename to modules/sql/pages/troubleshoot/oom-cancellations.adoc From c063882dce63bae1bfec7afe5d1be062163b04da Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:54:08 -0700 Subject: [PATCH 5/8] Update page attributes --- .../sql/pages/troubleshoot/oom-cancellations.adoc | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/modules/sql/pages/troubleshoot/oom-cancellations.adoc b/modules/sql/pages/troubleshoot/oom-cancellations.adoc index 9694abc2b..7298c3078 100644 --- a/modules/sql/pages/troubleshoot/oom-cancellations.adoc +++ b/modules/sql/pages/troubleshoot/oom-cancellations.adoc @@ -1,7 +1,10 @@ -= Troubleshoot memory-related query cancellations += Troubleshoot Memory-related Query Cancellations :description: Recover from query cancellations triggered by Redpanda SQL's automatic OOM prevention. -:page-topic-type: how-to +:page-topic-type: troubleshooting :personas: platform_admin, data_engineer +:learning-objective-1: Identify when query cancellations are caused by OOM prevention +:learning-objective-2: Recover from OOM-cancelled queries +:learning-objective-3: Monitor node memory usage to anticipate cancellations // TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. // Options: @@ -22,6 +25,12 @@ cancelled due to OOM prevention Only queries running on the affected node at the time of reclamation are cancelled. Other nodes in the cluster continue to serve queries. The node resumes accepting new queries immediately after reclamation completes, so in most cases you can retry the failed query and it succeeds. +Use this page to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + == If the error keeps happening If queries are repeatedly cancelled with this error, the workload is consistently approaching the node's memory limit. From acb83379419c84bb070280f00b714bedecc68223 Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Wed, 20 May 2026 22:05:34 -0700 Subject: [PATCH 6/8] Rename and structure doc per SME feedback --- modules/ROOT/nav.adoc | 2 +- .../pages/troubleshoot/oom-cancellations.adoc | 66 -------------- .../troubleshoot/query-out-of-memory.adoc | 86 +++++++++++++++++++ 3 files changed, 87 insertions(+), 67 deletions(-) delete mode 100644 modules/sql/pages/troubleshoot/oom-cancellations.adoc create mode 100644 modules/sql/pages/troubleshoot/query-out-of-memory.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index e77a08d36..9d539820e 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -361,7 +361,7 @@ *** xref:sql:manage/manage-access.adoc[Manage access] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/oom-cancellations.adoc[OOM Cancellations] +*** xref:sql:troubleshoot/query-out-of-memory.adoc[Query Out-of-Memory Errors] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] diff --git a/modules/sql/pages/troubleshoot/oom-cancellations.adoc b/modules/sql/pages/troubleshoot/oom-cancellations.adoc deleted file mode 100644 index 7298c3078..000000000 --- a/modules/sql/pages/troubleshoot/oom-cancellations.adoc +++ /dev/null @@ -1,66 +0,0 @@ -= Troubleshoot Memory-related Query Cancellations -:description: Recover from query cancellations triggered by Redpanda SQL's automatic OOM prevention. -:page-topic-type: troubleshooting -:personas: platform_admin, data_engineer -:learning-objective-1: Identify when query cancellations are caused by OOM prevention -:learning-objective-2: Recover from OOM-cancelled queries -:learning-objective-3: Monitor node memory usage to anticipate cancellations - -// TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. -// Options: -// "Troubleshoot memory-related query cancellations" (current; matches Troubleshoot section voice) -// "Recover from OOM cancellation" (concise; uses internal term) -// Keep "Memory management" (matches current nav label but doesn't signal action) - -Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its memory limit. If your application sees the following error, your queries have hit this protection: - -[source,text] ----- -cancelled due to OOM prevention ----- - -// TODO: SME — confirm the exact client-facing error envelope. The string above is the error reason raised internally by the engine. Clients connecting through `psql` or a PostgreSQL driver typically receive it wrapped in a PostgreSQL error message. Confirm: -// - Is a SQLSTATE code set on this error? If so, which one? -// - Does the message reach the client verbatim, or is the wording different? - -Only queries running on the affected node at the time of reclamation are cancelled. Other nodes in the cluster continue to serve queries. The node resumes accepting new queries immediately after reclamation completes, so in most cases you can retry the failed query and it succeeds. - -Use this page to: - -* [ ] {learning-objective-1} -* [ ] {learning-objective-2} -* [ ] {learning-objective-3} - -== If the error keeps happening - -If queries are repeatedly cancelled with this error, the workload is consistently approaching the node's memory limit. - -// TODO: SME — runbook depth. Confirm which of the following actions to recommend, and in what order. Suggested guidance to validate: -// - Reduce query concurrency on the affected workload. -// - Simplify the query — narrow the scan range, add filters, reduce parallel CTEs. -// - Scale up the cluster. -// Also confirm: is there a heuristic for choosing among them (for example, look at oxla_process_memory_total over time)? - -== Why this happens - -Redpanda SQL monitors each node's resident memory usage and triggers a brief reclamation phase when the node approaches its memory limit. During reclamation, the node cancels its running queries and frees memory so it can keep serving new queries. The protection runs on each node independently and is always on. There is no configuration option to enable, disable, or tune it. - -== Monitor memory usage - -Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: - -[cols="1,3"] -|=== -| Metric | Description - -| `oxla_process_memory_total` -| Process Resident Set Size (RSS) in bytes, reported per node. -|=== - -// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented and add a cross-link from "Suggested reading" below to that page. - -== Suggested reading - -* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. -* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. - diff --git a/modules/sql/pages/troubleshoot/query-out-of-memory.adoc b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc new file mode 100644 index 000000000..7e8bb5966 --- /dev/null +++ b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc @@ -0,0 +1,86 @@ += Troubleshoot Query Out-of-Memory Errors +:description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. +:page-topic-type: troubleshooting +:personas: platform_admin, data_engineer +:learning-objective-1: Identify when a query was cancelled because it ran out of memory +:learning-objective-2: Recover from a `Query out of memory` error and reduce its frequency +:learning-objective-3: Monitor node memory usage to anticipate memory pressure + +If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client: + +// TODO: SME — the OOM error currently surfaces with SQLSTATE XX000 +// (InternalError, the fall-through default in session.cpp). That's a generic +// "internal error" class, not a memory-class code like 53200 (out_of_memory) +// or 53400 (configuration_limit_exceeded). Confirm whether this is intentional +// or a bug to be fixed — if the SQLSTATE changes, add the +// verbose-mode example. +[source,text] +---- +ERROR: Query Out of Memory! +---- + +Canceling the query frees its memory and allows the engine to continue serving other queries. This is a normal protection mechanism and is not a sign of cluster failure. + +Use this page to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== How Redpanda SQL uses memory + +// TODO: SME — rewrite this section per the PR 584 review thread. +// Placeholder below is suggested draft from the review. +// +// Goal of the section: explain enough about RP SQL's memory model that a +// user reading this troubleshooting page understands *why* a Query out of +// Memory error can happen even on large clusters, and what shapes their +// query / workload to make it more or less likely. + +Redpanda SQL queries can read very large input sources (many terabytes). However, the result set and any intermediate results produced by operations such as joins and aggregations must fit into the aggregate memory available across all nodes in the cluster. All concurrently running queries contribute to total memory consumption, so a single query can hit the node memory limit because of pressure from other queries running at the same time. + +== Recover from the error + +When a single query fails with `Query out of Memory`, retry it. The error frees the query's memory, so the next attempt often succeeds, especially if other concurrent queries have completed in the meantime. + +If the same query keeps failing, the query itself is too memory-hungry for the current cluster size, or too many other queries are competing for memory at the same time. Reduce the query's memory footprint or reduce concurrent load: + +* Reduce concurrency. ++ +Run fewer queries in parallel against the cluster. Other queries running at the same time contribute to the total memory pressure. +* Simplify the query. ++ +Narrow the scan range with tighter `WHERE` filters, reduce the number of `JOIN`s, or break a large aggregation into smaller ones. Operations that materialize wide intermediate results (joins, sorts, distinct aggregations) drive memory consumption the most. +* Scale the cluster. ++ +Add SQL nodes to increase the aggregate memory available to queries. See xref:sql:get-started/deploy-sql-cluster.adoc#scale-redpanda-sql[Scale Redpanda SQL]. + +// TODO: SME — confirm the recovery order above and whether a heuristic +// exists for choosing among them (for example, watching +// `oxla_process_memory_total` over time before deciding to scale). + +== Monitor memory usage + +Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: + +[cols="1,3"] +|=== +| Metric | Description + +| `oxla_process_memory_total` +| Process Resident Set Size (RSS) in bytes, reported per node. +|=== + +// TODO: Once the Redpanda SQL metrics catalog is finalized, replace this +// inline table with a cross-link to the metrics page. + +== If you see `cancelled due to OOM prevention` instead + +The `cancelled due to OOM prevention` error is a separate case. Redpanda SQL's engine includes an overseer that monitors overall node memory independently of per-query accounting. When the overseer detects that the untracked memory pool has grown unexpectedly, it cancels running queries on the affected node to keep the engine operational. + +This condition is rare and almost always indicates a bug in memory accounting or an unexpected workload pattern. Collect the cluster logs from around the time of the error and contact https://support.redpanda.com/hc/en-us/requests/new[Redpanda Support^]. + +== Suggested reading + +* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. +* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. From cf137d3c2cab8d1bdafbf961ac176a8bf4c86112 Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Fri, 22 May 2026 13:57:20 -0700 Subject: [PATCH 7/8] Apply suggestions from review --- .../troubleshoot/query-out-of-memory.adoc | 41 ++++++------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/modules/sql/pages/troubleshoot/query-out-of-memory.adoc b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc index 7e8bb5966..32b0bb801 100644 --- a/modules/sql/pages/troubleshoot/query-out-of-memory.adoc +++ b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc @@ -2,19 +2,13 @@ :description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. :page-topic-type: troubleshooting :personas: platform_admin, data_engineer -:learning-objective-1: Identify when a query was cancelled because it ran out of memory -:learning-objective-2: Recover from a `Query out of memory` error and reduce its frequency +:learning-objective-1: Identify when a query was canceled because it ran out of memory +:learning-objective-2: Recover from a query out-of-memory error and reduce its frequency :learning-objective-3: Monitor node memory usage to anticipate memory pressure If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client: -// TODO: SME — the OOM error currently surfaces with SQLSTATE XX000 -// (InternalError, the fall-through default in session.cpp). That's a generic -// "internal error" class, not a memory-class code like 53200 (out_of_memory) -// or 53400 (configuration_limit_exceeded). Confirm whether this is intentional -// or a bug to be fixed — if the SQLSTATE changes, add the -// verbose-mode example. -[source,text] +[source,bash] ---- ERROR: Query Out of Memory! ---- @@ -29,15 +23,11 @@ Use this page to: == How Redpanda SQL uses memory -// TODO: SME — rewrite this section per the PR 584 review thread. -// Placeholder below is suggested draft from the review. -// -// Goal of the section: explain enough about RP SQL's memory model that a -// user reading this troubleshooting page understands *why* a Query out of -// Memory error can happen even on large clusters, and what shapes their -// query / workload to make it more or less likely. +Redpanda SQL queries can read very large input sources (many terabytes), and simple operations such as filtering or projection process input incrementally with a small, roughly constant memory footprint. In those cases the engine can process far more data than fits in RAM. -Redpanda SQL queries can read very large input sources (many terabytes). However, the result set and any intermediate results produced by operations such as joins and aggregations must fit into the aggregate memory available across all nodes in the cluster. All concurrently running queries contribute to total memory consumption, so a single query can hit the node memory limit because of pressure from other queries running at the same time. +Memory pressure comes from operations that materialize intermediate state: hash tables for `JOIN` and `GROUP BY`, heaps for `ORDER BY` and top-K, and network buffers between nodes. All of this intermediate state, along with the final result set, must fit into the aggregate memory available across the cluster. The engine does not spill intermediate state to disk, so a query that builds intermediate structures larger than available memory is canceled rather than slowed. + +All concurrently running queries contribute to total memory consumption, so a single query can hit the node memory limit because of pressure from other queries running at the same time. == Recover from the error @@ -50,31 +40,24 @@ If the same query keeps failing, the query itself is too memory-hungry for the c Run fewer queries in parallel against the cluster. Other queries running at the same time contribute to the total memory pressure. * Simplify the query. + -Narrow the scan range with tighter `WHERE` filters, reduce the number of `JOIN`s, or break a large aggregation into smaller ones. Operations that materialize wide intermediate results (joins, sorts, distinct aggregations) drive memory consumption the most. +Narrow the scan range with tighter `WHERE` filters, reduce the number of joins, or break a large aggregation into smaller ones. Operations that materialize wide intermediate results (joins, sorts, distinct aggregations) drive memory consumption the most. * Scale the cluster. + Add SQL nodes to increase the aggregate memory available to queries. See xref:sql:get-started/deploy-sql-cluster.adoc#scale-redpanda-sql[Scale Redpanda SQL]. -// TODO: SME — confirm the recovery order above and whether a heuristic -// exists for choosing among them (for example, watching -// `oxla_process_memory_total` over time before deciding to scale). - == Monitor memory usage -Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: +Use the following Prometheus gauge to track memory consumed by the SQL workload and watch for sustained growth toward the node's limit: [cols="1,3"] |=== | Metric | Description -| `oxla_process_memory_total` -| Process Resident Set Size (RSS) in bytes, reported per node. +| `query_memory_consumption_total` +| Memory consumed by queries on the node, in bytes. Use this metric to monitor workload memory usage. Unlike `oxla_process_memory_total`, which includes cached files and can sit near the limit even when the node is idle, `query_memory_consumption_total` reflects only memory attributed to query execution. |=== -// TODO: Once the Redpanda SQL metrics catalog is finalized, replace this -// inline table with a cross-link to the metrics page. - -== If you see `cancelled due to OOM prevention` instead +== OOM prevention cancellations The `cancelled due to OOM prevention` error is a separate case. Redpanda SQL's engine includes an overseer that monitors overall node memory independently of per-query accounting. When the overseer detects that the untracked memory pool has grown unexpectedly, it cancels running queries on the affected node to keep the engine operational. From f23c4518f0a7a6a60f03fd6e8bfd21572387f113 Mon Sep 17 00:00:00 2001 From: Kat Batuigas <36839689+kbatuigas@users.noreply.github.com> Date: Fri, 22 May 2026 17:05:30 -0700 Subject: [PATCH 8/8] Apply suggestions from code review Co-authored-by: Michele Cyran --- modules/sql/pages/troubleshoot/query-out-of-memory.adoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/modules/sql/pages/troubleshoot/query-out-of-memory.adoc b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc index 32b0bb801..6fbb81033 100644 --- a/modules/sql/pages/troubleshoot/query-out-of-memory.adoc +++ b/modules/sql/pages/troubleshoot/query-out-of-memory.adoc @@ -1,5 +1,5 @@ = Troubleshoot Query Out-of-Memory Errors -:description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. +:description: Recover from query out-of-memory errors in Redpanda SQL and understand the memory limits that govern query execution. :page-topic-type: troubleshooting :personas: platform_admin, data_engineer :learning-objective-1: Identify when a query was canceled because it ran out of memory @@ -8,7 +8,7 @@ If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client: -[source,bash] +[source,sql] ---- ERROR: Query Out of Memory! ---- @@ -65,5 +65,5 @@ This condition is rare and almost always indicates a bug in memory accounting or == Suggested reading -* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. -* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. +* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS] +* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]