fix: correctness bugs from reliability audit#220
Conversation
C1 (Critical): Fix evaluate_condition — empty result set now returns false
When a SQL condition query returns zero rows (e.g., SELECT id FROM queue
LIMIT 1 on an empty table), the condition now correctly evaluates as false.
Previously, the code fell through to is_truthy() on the envelope JSON object
{"rows":[], "row_count":0} which returned true (non-empty object).
This affected df.if() and df.loop() conditions that relied on 'no rows = false'.
H1 (High): Add recursion depth limit (MAX_GRAPH_DEPTH = 256)
validate_recursive() now rejects graphs exceeding 256 levels of nesting.
This prevents stack overflow from deeply nested operator chains
(e.g., 10,000+ levels of ~> composed in a loop).
H2 (High): Add node count limit (MAX_GRAPH_NODES = 10,000)
insert_nodes() now tracks a running count and rejects graphs with more
than 10,000 nodes. This prevents unbounded INSERT storms, OOM, and
transaction log bloat from extremely large graphs.
H4 (High): Add fallible Durofut::try_from_json()
Added a safe deserialization method that returns Result<Self, String>
instead of panicking. Production code paths that handle potentially
corrupted data should use this instead of from_json().
H7 (High): Add 30-second connection timeout to connect_as_user()
Per-user SQL connections now have a 30-second timeout via
tokio::time::timeout. Previously, a stalled TCP connection could hold
a semaphore permit indefinitely, potentially exhausting all execution
slots and causing a deadlock-like condition.
Add pg_test and unit tests that validate the following fixes don't regress:
C1 - evaluate_condition with empty rows:
- test_evaluate_condition_empty_rows_is_false: verifies {"rows":[]}
evaluates as false (was previously true due to non-empty object check)
- test_evaluate_condition_single_true_row: baseline truthy case
- test_evaluate_condition_single_false_row: baseline falsy case
- test_evaluate_condition_zero_count_is_falsy: numeric zero in row
H1 - Recursion depth limit:
- test_validate_rejects_graph_exceeding_depth_limit: builds a chain of
MAX_GRAPH_DEPTH+1 nested THEN nodes and verifies validate_recursive()
returns an error mentioning the depth limit
- test_validate_accepts_graph_within_depth_limit: 10-level chain passes
H2 - Node count limit:
- test_start_rejects_graph_exceeding_node_count: constructs a graph
with >MAX_GRAPH_NODES nodes via chained ~> operators and verifies
df.start() rejects it
H4 - Fallible try_from_json:
- test_try_from_json_invalid_json_returns_error: garbage input -> Err
- test_try_from_json_valid_json_succeeds: valid Durofut JSON -> Ok
- test_try_from_json_corrupted_node_type_returns_error: valid JSON but
not a valid Durofut structure -> Err
Also adds empty-rows test cases to the existing evaluate_condition_json_rows
unit test in types.rs.
There was a problem hiding this comment.
Pull request overview
This PR addresses several correctness and reliability findings in pg_durable by tightening workflow-graph validation, hardening JSON parsing, and improving connection behavior, with added regression/unit tests to prevent regressions.
Changes:
- Fix condition evaluation so an empty SQL result set (
{"rows":[]}) is treated asfalse. - Add guardrails against pathological workflow graphs via a maximum nesting depth and maximum node count.
- Add a fallible
DurofutJSON deserializer and a per-user connection timeout inconnect_as_user(), plus new tests.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/types.rs |
Adds graph limits/constants, condition empty-rows handling, connect_as_user() timeout, Durofut::try_from_json(), and depth validation + unit tests. |
src/dsl.rs |
Enforces MAX_GRAPH_NODES during recursive insertion via a shared node counter. |
src/lib.rs |
Adds regression tests covering the audit findings (condition truthiness, depth limit, node-count limit, try_from_json). |
| fn test_start_rejects_graph_exceeding_node_count() { | ||
| use crate::types::MAX_GRAPH_NODES; | ||
|
|
||
| // Build a wide graph that exceeds MAX_GRAPH_NODES by chaining sequences. | ||
| // Each ~> produces a THEN node + right SQL node, so N sequences = ~2N nodes. | ||
| // We need MAX_GRAPH_NODES/2 + 1 sequences to exceed the limit. | ||
| let needed = MAX_GRAPH_NODES / 2 + 1; | ||
| let mut expr = String::from("'SELECT 1'"); | ||
| for _ in 0..needed { | ||
| expr.push_str(" ~> 'SELECT 1'"); | ||
| } | ||
|
|
||
| // This should fail at df.start() with a node count error. | ||
| // Use SPI to call df.start and expect an error. | ||
| let sql = format!("SELECT df.start({expr})"); | ||
| let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { | ||
| Spi::get_one::<String>(&sql).ok() | ||
| })); | ||
|
|
||
| // pgrx::error! causes a panic that gets caught here | ||
| assert!( | ||
| result.is_err(), | ||
| "df.start() should reject graph exceeding node count limit" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Good catch. Reworked the test to construct a shallow-but-wide graph using a JOIN node with MAX_GRAPH_NODES extra_nodes entries (stays at depth 1). This ensures we specifically hit the node-count limit without triggering the depth check first. Also moved the node-count enforcement into validate_recursive_inner itself (see comment 2), so the test now calls validate_recursive() directly on the wide struct.
| /// Validate a Durofut node and all its children have valid node_types. | ||
| /// Used during insertion in df.start() to catch invalid nested nodes. | ||
| pub fn validate_recursive(&self) -> Result<(), String> { | ||
| self.validate_recursive_inner(0) | ||
| } |
There was a problem hiding this comment.
Agreed — added a &mut usize node counter to validate_recursive_inner so both depth and node-count are enforced in a single pass. Oversized graphs now fail fast during validation without needing a second traversal in insert_nodes. The insert_nodes counter remains as a defense-in-depth safeguard.
- Add mutable node counter to validate_recursive_inner so both depth and node-count are checked in a single pass (fail-fast). - Rewrite node-count test to use a shallow-but-wide JOIN graph that exceeds MAX_GRAPH_NODES without hitting MAX_GRAPH_DEPTH. - Add unit tests for node-count validation in src/types.rs.
- Add mutable node counter to validate_recursive_inner so both depth and node-count are checked in a single pass (fail-fast). - Rewrite node-count test to use a shallow-but-wide JOIN graph that exceeds MAX_GRAPH_NODES without hitting MAX_GRAPH_DEPTH. - Add unit tests for node-count validation in src/types.rs.
Summary
Fixes 5 correctness bugs identified during a deep reliability audit of the pg_durable codebase. Each fix includes regression tests to prevent recurrence.
C1 (Critical): Empty result set now evaluates as
falsein conditionsBug: When a SQL condition query returned zero rows (e.g.,
SELECT id FROM queue LIMIT 1on an empty table), the condition incorrectly evaluated astrue. The code fell through tois_truthy()on the envelope JSON{"rows":[], "row_count":0}, which returnedtruebecause it's a non-empty object.Impact:
df.if()anddf.loop()conditions relying on "no rows = false" were silently broken. A while-loop checkingSELECT id FROM work_queue LIMIT 1would loop forever even when empty.Fix: Explicitly return
falsewhenrowsis present and empty (src/types.rs:287).H1 (High): Recursion depth limit (MAX_GRAPH_DEPTH = 256)
Bug:
validate_recursive()andinsert_nodes()recursed without a depth limit. A graph with 10,000+ levels of~>would stack-overflow the PostgreSQL backend.Fix: Added depth tracking to
validate_recursive()with a 256-level limit. Clear error message on rejection.H2 (High): Node count limit (MAX_GRAPH_NODES = 10,000)
Bug: No upper bound on nodes per instance. A user could submit a graph with millions of nodes, causing OOM, transaction log bloat, and SPI exhaustion.
Fix: Added a running counter to
insert_nodes()that rejects graphs exceeding 10,000 nodes.H4 (High): Fallible
Durofut::try_from_json()Bug:
from_json()used.expect()which would panic (crash the worker) on corrupted JSON data.Fix: Added
try_from_json()returningResult<Self, String>for use in production paths. Existingfrom_json()retained for test convenience with documentation noting it's test-only.H7 (High): 30-second connection timeout on
connect_as_user()Bug:
PgConnection::connect_with()had no timeout. A stalled TCP connection could hold a semaphore permit indefinitely, exhausting all user-execution slots.Fix: Wrapped the connect call with
tokio::time::timeout(30s). Clear error message distinguishes timeout from auth failure.Tests Added
test_evaluate_condition_empty_rows_is_false{"rows":[]}\u2192falsetest_evaluate_condition_single_true_rowtest_evaluate_condition_single_false_rowtest_evaluate_condition_zero_count_is_falsytest_validate_rejects_graph_exceeding_depth_limittest_validate_accepts_graph_within_depth_limittest_start_rejects_graph_exceeding_node_counttest_try_from_json_invalid_json_returns_errortest_try_from_json_valid_json_succeedstest_try_from_json_corrupted_node_type_returns_errorPlus 2 unit tests in
types.rs::testsfor empty-rows + depth-limit boundary.Verification
cargo checkpasses (no new warnings)cargo test --lib types::— 31/31 unit tests passcargo check --tests— full test compilation clean