SQL: OOM by kbatuigas · Pull Request #584 · redpanda-data/cloud-docs

kbatuigas · 2026-05-14T03:43:40Z

Description

This pull request adds a new troubleshooting guide focused on handling memory-related query cancellations in Redpanda SQL. The page explains the automatic out-of-memory (OOM) protection mechanism, describes the client-facing error, and gives actionable steps for users to recover from or prevent repeated cancellations. It also provides guidance on monitoring memory usage and includes several TODOs for subject matter expert (SME) validation.

Key additions:

Troubleshooting documentation:

Added a new page, memory-management.adoc, that explains how Redpanda SQL cancels queries when a node approaches its memory limit and how users can recover from or prevent these cancellations.
Provided actionable recommendations for users experiencing repeated OOM cancellations, including reducing query concurrency, simplifying queries, or scaling up the cluster.
Documented how to monitor node memory usage using the oxla_process_memory_total Prometheus metric.

Guidance for further validation:

Included several SME-directed TODOs to confirm error message details, recommended runbook steps, and configuration options for memory limits.

Resolves https://github.com/redpanda-data/documentation-private/issues/
Review deadline: 21 May

Page previews

Redpanda SQL > Troubleshoot > nav: OOM Cancellations / page title: Troubleshoot Memory-related Query Cancellations

Checks

New feature
Content gap
Support Follow-up
Small fix (typos, links, copyedits, etc)

netlify · 2026-05-14T03:43:44Z

✅ Deploy Preview for rp-cloud ready!

Name	Link
🔨 Latest commit	`8ab585b`
🔍 Latest deploy log	https://app.netlify.com/projects/rp-cloud/deploys/6a0e92ba9dd13e0008db9f0c
😎 Deploy Preview	https://deploy-preview-584--rp-cloud.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-05-14T03:43:46Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aa9d287a-5e0f-4fe0-bd10-7b7660c2910c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch DOC-2000-document-feature-oom-prevention

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Greketrotny · 2026-05-20T09:43:02Z

+
+[source,text]
+----
+cancelled due to OOM prevention


The cancelled due to OOM prevention is a sibling error to a primary user-facing one: Query Out of Memory.
Query out of Memory is reported when a particular query exhausted all memory resources and had to be cancelled. This is a normal behavior, as the engine counts the allocated memory and prevents it from entering an unexpected state or a deadlock. With this error, it is advised to retry the query or cancel/wait for other concurrently running tasks to finish before retry. I feel like this page is describing this case, but with the wrong error message.
The thing is, the engine doesn't track all allocations, so it doesn't have full control over the allocated memory. This is where the cancelled due to OOM prevention error comes in.

The OOM prevention mechanism is an overseer. It's addressing this by monitoring the overall memory usage in an external, independent way. It's more of an emergency handler, which quickly frees reclaimable resources to remain operational. However, triggering this situation is a result of either the untracked pool exceeding unexpectedly or a serious problem with memory tracking, and should probably almost always result in a bug report by the client with access to the logs. This, I suspect, is more like a runbook/customer support scenario.

I don't know what should be exactly visible in the public documentation, but I feel like this page blends two problems, and there are two parts to describe/discuss, the first one should be definitely visible to the user with an explanation why this happens, and the second (the emergency one) is more like an issue/emergency. Maybe it should be present in the docs too, but on a different page.

Right, let's focus on the user-facing error @kbatuigas

mattschumpert · 2026-05-20T20:56:28Z

+
+[source,text]
+----
+cancelled due to OOM prevention


Right, let's focus on the user-facing error @kbatuigas

mattschumpert · 2026-05-20T20:57:42Z

+
+[source,text]
+----
+cancelled due to OOM prevention


@kbatuigas where exactly does the user see this error? This doesn't appear to be a complete example.

What does that actually look like in the psql client. Let's show the actual real-world example.

Is this standard Portgres error code?

I tried some digging with Claude and updated with what I found: https://github.com/redpanda-data/cloud-docs/pull/584/changes#diff-05dde76066e0f81b1b4af9298c747919bd040631aa9a72704836954902d8a59fR19 does that look right @Greketrotny

A wrong metric (https://github.com/redpanda-data/cloud-docs/pull/584/changes#r3279031995), but everything else looks good.

mattschumpert · 2026-05-20T21:02:14Z

+//   "Recover from OOM cancellation" (concise; uses internal term)
+//   Keep "Memory management" (matches current nav label but doesn't signal action)
+
+Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its memory limit. If your application sees the following error, your queries have hit this protection:


@kbatuigas here I think we need to explain the overall memory limits principles with RP SQL so users understand the reason why they might be seeing this kind of error ,(esp if we dont have a separate scale guide, lets thus give them a hint here).

Something like: "While RP SQL queries can process very large input sources (many TB) RP SQL query results (and intermediate results created by operations like joins and aggregations) must fit into the aggregate available memory available to all nodes in the cluster (as reported by this metric ...._). All concurrently running queries contribute to total memory consumption and any one query can cause the node memory limits to be hit based on other concurrent queries ..."

I think @Greketrotny should draft /update this ^^^

kbatuigas · 2026-05-21T05:34:28Z

@Greketrotny @mattschumpert I restructured based on your comments-- changed the page title to Troubleshoot Query Out-of-Memory Errors, put in a placeholder section for "How Redpanda SQL uses memory" based on Matt's suggestion (Grzegorz please edit), and kept a short "cancelled due to OOM prevention" section at the end https://deploy-preview-584--rp-cloud.netlify.app/redpanda-cloud/sql/troubleshoot/query-out-of-memory/ Not sure if we'd start a new doc entirely, do we have enough end-user-facing content for it?

Greketrotny · 2026-05-21T06:13:44Z

+
+// TODO: SME — confirm the recovery order above and whether a heuristic
+// exists for choosing among them (for example, watching
+// `oxla_process_memory_total` over time before deciding to scale).


No, oxla_process_memory_total can constantly be just below the limit, even when idle, as this metric also includes the memory allocated for cached files. To specifically monitor memory usage for the workload/queries, use the query_memory_consumption_total.

Greketrotny · 2026-05-21T06:15:26Z

+// (InternalError, the fall-through default in session.cpp). That's a generic
+// "internal error" class, not a memory-class code like 53200 (out_of_memory)
+// or 53400 (configuration_limit_exceeded). Confirm whether this is intentional
+// or a bug to be fixed — if the SQLSTATE changes, add the


yes, this is probably a bug

Greketrotny · 2026-05-21T12:36:08Z

The description of the errors, distinctions, consequences, and mitigations looks good to me. What's maybe missing is why all of this exists.

The engine currently must fit all intermediate data/calculations in RAM (hashmaps for JOIN and GROUP BY, ORDER BY /TOP K heaps, network buffers), and there is no spilling implemented. And also, what @mattschumpert said, this doesn't mean that the whole data set must fit into the engine, as simple operations have a small and constant footprint, and the engine can process an amount of data vastly greater than the available RAM.

I'm sure LLM will compose a nice description from the threads here. I hope that level of detail is sufficient for the public docks.

kbatuigas force-pushed the rp-sql branch from e051360 to 1b9d587 Compare May 19, 2026 03:26

Start OOM doc draft

7e0ff93

kbatuigas force-pushed the DOC-2000-document-feature-oom-prevention branch from fc9e91c to 7e0ff93 Compare May 19, 2026 03:30

Review pass

76f8348

kbatuigas changed the title ~~Start OOM doc draft~~ SQL: OOM May 20, 2026

kbatuigas marked this pull request as ready for review May 20, 2026 00:23

kbatuigas requested a review from a team as a code owner May 20, 2026 00:23

kbatuigas added 3 commits May 19, 2026 17:26

Capitalization

5bb8511

Rename file

f7291d2

Update page attributes

f63c3df

kbatuigas requested a review from mattschumpert May 20, 2026 04:02

Greketrotny reviewed May 20, 2026

View reviewed changes

mattschumpert approved these changes May 20, 2026

View reviewed changes

Rename and structure doc per SME feedback

8ab585b

kbatuigas requested a review from Greketrotny May 21, 2026 05:34

Greketrotny reviewed May 21, 2026

View reviewed changes

Conversation

kbatuigas commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Page previews

Checks

Uh oh!

netlify Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for rp-cloud ready!

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Greketrotny May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbatuigas commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Greketrotny commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kbatuigas commented May 14, 2026 •

edited

Loading

netlify Bot commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Greketrotny May 20, 2026 •

edited

Loading