fix: correctness bugs from reliability audit by iemejia · Pull Request #220 · microsoft/pg_durable

iemejia · 2026-06-10T11:20:46Z

Summary

Fixes 5 correctness bugs identified during a deep reliability audit of the pg_durable codebase. Each fix includes regression tests to prevent recurrence.

C1 (Critical): Empty result set now evaluates as `false` in conditions

Bug: When a SQL condition query returned zero rows (e.g., SELECT id FROM queue LIMIT 1 on an empty table), the condition incorrectly evaluated as true. The code fell through to is_truthy() on the envelope JSON {"rows":[], "row_count":0}, which returned true because it's a non-empty object.

Impact: df.if() and df.loop() conditions relying on "no rows = false" were silently broken. A while-loop checking SELECT id FROM work_queue LIMIT 1 would loop forever even when empty.

Fix: Explicitly return false when rows is present and empty (src/types.rs:287).

H1 (High): Recursion depth limit (MAX_GRAPH_DEPTH = 256)

Bug: validate_recursive() and insert_nodes() recursed without a depth limit. A graph with 10,000+ levels of ~> would stack-overflow the PostgreSQL backend.

Fix: Added depth tracking to validate_recursive() with a 256-level limit. Clear error message on rejection.

H2 (High): Node count limit (MAX_GRAPH_NODES = 10,000)

Bug: No upper bound on nodes per instance. A user could submit a graph with millions of nodes, causing OOM, transaction log bloat, and SPI exhaustion.

Fix: Added a running counter to insert_nodes() that rejects graphs exceeding 10,000 nodes.

H4 (High): Fallible `Durofut::try_from_json()`

Bug: from_json() used .expect() which would panic (crash the worker) on corrupted JSON data.

Fix: Added try_from_json() returning Result<Self, String> for use in production paths. Existing from_json() retained for test convenience with documentation noting it's test-only.

H7 (High): 30-second connection timeout on `connect_as_user()`

Bug: PgConnection::connect_with() had no timeout. A stalled TCP connection could hold a semaphore permit indefinitely, exhausting all user-execution slots.

Fix: Wrapped the connect call with tokio::time::timeout(30s). Clear error message distinguishes timeout from auth failure.

Tests Added

Bug	Test	Validates
C1	`test_evaluate_condition_empty_rows_is_false`	`{"rows":[]}` \u2192 `false`
C1	`test_evaluate_condition_single_true_row`	Baseline truthy
C1	`test_evaluate_condition_single_false_row`	Baseline falsy
C1	`test_evaluate_condition_zero_count_is_falsy`	Numeric zero \u2192 falsy
H1	`test_validate_rejects_graph_exceeding_depth_limit`	257-deep chain rejected
H1	`test_validate_accepts_graph_within_depth_limit`	10-deep chain accepted
H2	`test_start_rejects_graph_exceeding_node_count`	>10K nodes rejected
H4	`test_try_from_json_invalid_json_returns_error`	Bad JSON \u2192 Err
H4	`test_try_from_json_valid_json_succeeds`	Good JSON \u2192 Ok
H4	`test_try_from_json_corrupted_node_type_returns_error`	Wrong structure \u2192 Err

Plus 2 unit tests in types.rs::tests for empty-rows + depth-limit boundary.

Verification

cargo check passes (no new warnings)
cargo test --lib types:: — 31/31 unit tests pass
cargo check --tests — full test compilation clean

C1 (Critical): Fix evaluate_condition — empty result set now returns false When a SQL condition query returns zero rows (e.g., SELECT id FROM queue LIMIT 1 on an empty table), the condition now correctly evaluates as false. Previously, the code fell through to is_truthy() on the envelope JSON object {"rows":[], "row_count":0} which returned true (non-empty object). This affected df.if() and df.loop() conditions that relied on 'no rows = false'. H1 (High): Add recursion depth limit (MAX_GRAPH_DEPTH = 256) validate_recursive() now rejects graphs exceeding 256 levels of nesting. This prevents stack overflow from deeply nested operator chains (e.g., 10,000+ levels of ~> composed in a loop). H2 (High): Add node count limit (MAX_GRAPH_NODES = 10,000) insert_nodes() now tracks a running count and rejects graphs with more than 10,000 nodes. This prevents unbounded INSERT storms, OOM, and transaction log bloat from extremely large graphs. H4 (High): Add fallible Durofut::try_from_json() Added a safe deserialization method that returns Result<Self, String> instead of panicking. Production code paths that handle potentially corrupted data should use this instead of from_json(). H7 (High): Add 30-second connection timeout to connect_as_user() Per-user SQL connections now have a 30-second timeout via tokio::time::timeout. Previously, a stalled TCP connection could hold a semaphore permit indefinitely, potentially exhausting all execution slots and causing a deadlock-like condition.

Add pg_test and unit tests that validate the following fixes don't regress: C1 - evaluate_condition with empty rows: - test_evaluate_condition_empty_rows_is_false: verifies {"rows":[]} evaluates as false (was previously true due to non-empty object check) - test_evaluate_condition_single_true_row: baseline truthy case - test_evaluate_condition_single_false_row: baseline falsy case - test_evaluate_condition_zero_count_is_falsy: numeric zero in row H1 - Recursion depth limit: - test_validate_rejects_graph_exceeding_depth_limit: builds a chain of MAX_GRAPH_DEPTH+1 nested THEN nodes and verifies validate_recursive() returns an error mentioning the depth limit - test_validate_accepts_graph_within_depth_limit: 10-level chain passes H2 - Node count limit: - test_start_rejects_graph_exceeding_node_count: constructs a graph with >MAX_GRAPH_NODES nodes via chained ~> operators and verifies df.start() rejects it H4 - Fallible try_from_json: - test_try_from_json_invalid_json_returns_error: garbage input -> Err - test_try_from_json_valid_json_succeeds: valid Durofut JSON -> Ok - test_try_from_json_corrupted_node_type_returns_error: valid JSON but not a valid Durofut structure -> Err Also adds empty-rows test cases to the existing evaluate_condition_json_rows unit test in types.rs.

Copilot

Pull request overview

This PR addresses several correctness and reliability findings in pg_durable by tightening workflow-graph validation, hardening JSON parsing, and improving connection behavior, with added regression/unit tests to prevent regressions.

Changes:

Fix condition evaluation so an empty SQL result set ({"rows":[]}) is treated as false.
Add guardrails against pathological workflow graphs via a maximum nesting depth and maximum node count.
Add a fallible Durofut JSON deserializer and a per-user connection timeout in connect_as_user(), plus new tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/types.rs`	Adds graph limits/constants, condition empty-rows handling, `connect_as_user()` timeout, `Durofut::try_from_json()`, and depth validation + unit tests.
`src/dsl.rs`	Enforces `MAX_GRAPH_NODES` during recursive insertion via a shared node counter.
`src/lib.rs`	Adds regression tests covering the audit findings (condition truthiness, depth limit, node-count limit, try_from_json).

iemejia · 2026-06-11T10:58:27Z

+    fn test_start_rejects_graph_exceeding_node_count() {
+        use crate::types::MAX_GRAPH_NODES;
+
+        // Build a wide graph that exceeds MAX_GRAPH_NODES by chaining sequences.
+        // Each ~> produces a THEN node + right SQL node, so N sequences = ~2N nodes.
+        // We need MAX_GRAPH_NODES/2 + 1 sequences to exceed the limit.
+        let needed = MAX_GRAPH_NODES / 2 + 1;
+        let mut expr = String::from("'SELECT 1'");
+        for _ in 0..needed {
+            expr.push_str(" ~> 'SELECT 1'");
+        }
+
+        // This should fail at df.start() with a node count error.
+        // Use SPI to call df.start and expect an error.
+        let sql = format!("SELECT df.start({expr})");
+        let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
+            Spi::get_one::<String>(&sql).ok()
+        }));
+
+        // pgrx::error! causes a panic that gets caught here
+        assert!(
+            result.is_err(),
+            "df.start() should reject graph exceeding node count limit"
+        );
+    }


Good catch. Reworked the test to construct a shallow-but-wide graph using a JOIN node with MAX_GRAPH_NODES extra_nodes entries (stays at depth 1). This ensures we specifically hit the node-count limit without triggering the depth check first. Also moved the node-count enforcement into validate_recursive_inner itself (see comment 2), so the test now calls validate_recursive() directly on the wide struct.

iemejia · 2026-06-11T10:58:27Z

    /// Validate a Durofut node and all its children have valid node_types.
    /// Used during insertion in df.start() to catch invalid nested nodes.
    pub fn validate_recursive(&self) -> Result<(), String> {
+        self.validate_recursive_inner(0)
+    }


Agreed — added a &mut usize node counter to validate_recursive_inner so both depth and node-count are enforced in a single pass. Oversized graphs now fail fast during validation without needing a second traversal in insert_nodes. The insert_nodes counter remains as a defense-in-depth safeguard.

- Add mutable node counter to validate_recursive_inner so both depth and node-count are checked in a single pass (fail-fast). - Rewrite node-count test to use a shallow-but-wide JOIN graph that exceeds MAX_GRAPH_NODES without hitting MAX_GRAPH_DEPTH. - Add unit tests for node-count validation in src/types.rs.

iemejia added 3 commits June 10, 2026 12:30

style: cargo fmt

fc967e7

pinodeca requested a review from Copilot June 11, 2026 01:02

Copilot started reviewing on behalf of pinodeca June 11, 2026 01:02 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

iemejia added 2 commits June 11, 2026 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correctness bugs from reliability audit#220

fix: correctness bugs from reliability audit#220
iemejia wants to merge 5 commits into
microsoft:mainfrom
iemejia:fix/correctness-bugs-from-audit

iemejia commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

iemejia Jun 11, 2026

Uh oh!

iemejia Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Jun 10, 2026

Summary

C1 (Critical): Empty result set now evaluates as false in conditions

H1 (High): Recursion depth limit (MAX_GRAPH_DEPTH = 256)

H2 (High): Node count limit (MAX_GRAPH_NODES = 10,000)

H4 (High): Fallible Durofut::try_from_json()

H7 (High): 30-second connection timeout on connect_as_user()

Tests Added

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

iemejia Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

C1 (Critical): Empty result set now evaluates as `false` in conditions

H4 (High): Fallible `Durofut::try_from_json()`

H7 (High): 30-second connection timeout on `connect_as_user()`