-
Notifications
You must be signed in to change notification settings - Fork 4
SQL: OOM #584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kbatuigas
wants to merge
6
commits into
rp-sql
Choose a base branch
from
DOC-2000-document-feature-oom-prevention
base: rp-sql
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+87
−1
Open
SQL: OOM #584
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
7e0ff93
Start OOM doc draft
kbatuigas 76f8348
Review pass
kbatuigas 5bb8511
Capitalization
kbatuigas f7291d2
Rename file
kbatuigas f63c3df
Update page attributes
kbatuigas 8ab585b
Rename and structure doc per SME feedback
kbatuigas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| = Troubleshoot Query Out-of-Memory Errors | ||
| :description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. | ||
| :page-topic-type: troubleshooting | ||
| :personas: platform_admin, data_engineer | ||
| :learning-objective-1: Identify when a query was cancelled because it ran out of memory | ||
| :learning-objective-2: Recover from a `Query out of memory` error and reduce its frequency | ||
| :learning-objective-3: Monitor node memory usage to anticipate memory pressure | ||
|
|
||
| If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client: | ||
|
|
||
| // TODO: SME — the OOM error currently surfaces with SQLSTATE XX000 | ||
| // (InternalError, the fall-through default in session.cpp). That's a generic | ||
| // "internal error" class, not a memory-class code like 53200 (out_of_memory) | ||
| // or 53400 (configuration_limit_exceeded). Confirm whether this is intentional | ||
| // or a bug to be fixed — if the SQLSTATE changes, add the | ||
| // verbose-mode example. | ||
| [source,text] | ||
| ---- | ||
| ERROR: Query Out of Memory! | ||
| ---- | ||
|
|
||
| Canceling the query frees its memory and allows the engine to continue serving other queries. This is a normal protection mechanism and is not a sign of cluster failure. | ||
|
|
||
| Use this page to: | ||
|
|
||
| * [ ] {learning-objective-1} | ||
| * [ ] {learning-objective-2} | ||
| * [ ] {learning-objective-3} | ||
|
|
||
| == How Redpanda SQL uses memory | ||
|
|
||
| // TODO: SME — rewrite this section per the PR 584 review thread. | ||
| // Placeholder below is suggested draft from the review. | ||
| // | ||
| // Goal of the section: explain enough about RP SQL's memory model that a | ||
| // user reading this troubleshooting page understands *why* a Query out of | ||
| // Memory error can happen even on large clusters, and what shapes their | ||
| // query / workload to make it more or less likely. | ||
|
|
||
| Redpanda SQL queries can read very large input sources (many terabytes). However, the result set and any intermediate results produced by operations such as joins and aggregations must fit into the aggregate memory available across all nodes in the cluster. All concurrently running queries contribute to total memory consumption, so a single query can hit the node memory limit because of pressure from other queries running at the same time. | ||
|
|
||
| == Recover from the error | ||
|
|
||
| When a single query fails with `Query out of Memory`, retry it. The error frees the query's memory, so the next attempt often succeeds, especially if other concurrent queries have completed in the meantime. | ||
|
|
||
| If the same query keeps failing, the query itself is too memory-hungry for the current cluster size, or too many other queries are competing for memory at the same time. Reduce the query's memory footprint or reduce concurrent load: | ||
|
|
||
| * Reduce concurrency. | ||
| + | ||
| Run fewer queries in parallel against the cluster. Other queries running at the same time contribute to the total memory pressure. | ||
| * Simplify the query. | ||
| + | ||
| Narrow the scan range with tighter `WHERE` filters, reduce the number of `JOIN`s, or break a large aggregation into smaller ones. Operations that materialize wide intermediate results (joins, sorts, distinct aggregations) drive memory consumption the most. | ||
| * Scale the cluster. | ||
| + | ||
| Add SQL nodes to increase the aggregate memory available to queries. See xref:sql:get-started/deploy-sql-cluster.adoc#scale-redpanda-sql[Scale Redpanda SQL]. | ||
|
|
||
| // TODO: SME — confirm the recovery order above and whether a heuristic | ||
| // exists for choosing among them (for example, watching | ||
| // `oxla_process_memory_total` over time before deciding to scale). | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, |
||
|
|
||
| == Monitor memory usage | ||
|
|
||
| Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: | ||
|
|
||
| [cols="1,3"] | ||
| |=== | ||
| | Metric | Description | ||
|
|
||
| | `oxla_process_memory_total` | ||
| | Process Resident Set Size (RSS) in bytes, reported per node. | ||
| |=== | ||
|
|
||
| // TODO: Once the Redpanda SQL metrics catalog is finalized, replace this | ||
| // inline table with a cross-link to the metrics page. | ||
|
|
||
| == If you see `cancelled due to OOM prevention` instead | ||
|
|
||
| The `cancelled due to OOM prevention` error is a separate case. Redpanda SQL's engine includes an overseer that monitors overall node memory independently of per-query accounting. When the overseer detects that the untracked memory pool has grown unexpectedly, it cancels running queries on the affected node to keep the engine operational. | ||
|
|
||
| This condition is rare and almost always indicates a bug in memory accounting or an unexpected workload pattern. Collect the cluster logs from around the time of the error and contact https://support.redpanda.com/hc/en-us/requests/new[Redpanda Support^]. | ||
|
|
||
| == Suggested reading | ||
|
|
||
| * xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. | ||
| * xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is probably a bug