Skip to content

fix: correctness bugs from reliability audit#220

Open
iemejia wants to merge 5 commits into
microsoft:mainfrom
iemejia:fix/correctness-bugs-from-audit
Open

fix: correctness bugs from reliability audit#220
iemejia wants to merge 5 commits into
microsoft:mainfrom
iemejia:fix/correctness-bugs-from-audit

Conversation

@iemejia

@iemejia iemejia commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes 5 correctness bugs identified during a deep reliability audit of the pg_durable codebase. Each fix includes regression tests to prevent recurrence.

C1 (Critical): Empty result set now evaluates as false in conditions

Bug: When a SQL condition query returned zero rows (e.g., SELECT id FROM queue LIMIT 1 on an empty table), the condition incorrectly evaluated as true. The code fell through to is_truthy() on the envelope JSON {"rows":[], "row_count":0}, which returned true because it's a non-empty object.

Impact: df.if() and df.loop() conditions relying on "no rows = false" were silently broken. A while-loop checking SELECT id FROM work_queue LIMIT 1 would loop forever even when empty.

Fix: Explicitly return false when rows is present and empty (src/types.rs:287).

H1 (High): Recursion depth limit (MAX_GRAPH_DEPTH = 256)

Bug: validate_recursive() and insert_nodes() recursed without a depth limit. A graph with 10,000+ levels of ~> would stack-overflow the PostgreSQL backend.

Fix: Added depth tracking to validate_recursive() with a 256-level limit. Clear error message on rejection.

H2 (High): Node count limit (MAX_GRAPH_NODES = 10,000)

Bug: No upper bound on nodes per instance. A user could submit a graph with millions of nodes, causing OOM, transaction log bloat, and SPI exhaustion.

Fix: Added a running counter to insert_nodes() that rejects graphs exceeding 10,000 nodes.

H4 (High): Fallible Durofut::try_from_json()

Bug: from_json() used .expect() which would panic (crash the worker) on corrupted JSON data.

Fix: Added try_from_json() returning Result<Self, String> for use in production paths. Existing from_json() retained for test convenience with documentation noting it's test-only.

H7 (High): 30-second connection timeout on connect_as_user()

Bug: PgConnection::connect_with() had no timeout. A stalled TCP connection could hold a semaphore permit indefinitely, exhausting all user-execution slots.

Fix: Wrapped the connect call with tokio::time::timeout(30s). Clear error message distinguishes timeout from auth failure.

Tests Added

Bug Test Validates
C1 test_evaluate_condition_empty_rows_is_false {"rows":[]} \u2192 false
C1 test_evaluate_condition_single_true_row Baseline truthy
C1 test_evaluate_condition_single_false_row Baseline falsy
C1 test_evaluate_condition_zero_count_is_falsy Numeric zero \u2192 falsy
H1 test_validate_rejects_graph_exceeding_depth_limit 257-deep chain rejected
H1 test_validate_accepts_graph_within_depth_limit 10-deep chain accepted
H2 test_start_rejects_graph_exceeding_node_count >10K nodes rejected
H4 test_try_from_json_invalid_json_returns_error Bad JSON \u2192 Err
H4 test_try_from_json_valid_json_succeeds Good JSON \u2192 Ok
H4 test_try_from_json_corrupted_node_type_returns_error Wrong structure \u2192 Err

Plus 2 unit tests in types.rs::tests for empty-rows + depth-limit boundary.

Verification

  • cargo check passes (no new warnings)
  • cargo test --lib types:: — 31/31 unit tests pass
  • cargo check --tests — full test compilation clean

iemejia added 3 commits June 10, 2026 12:30
C1 (Critical): Fix evaluate_condition — empty result set now returns false

When a SQL condition query returns zero rows (e.g., SELECT id FROM queue
LIMIT 1 on an empty table), the condition now correctly evaluates as false.
Previously, the code fell through to is_truthy() on the envelope JSON object
{"rows":[], "row_count":0} which returned true (non-empty object).

This affected df.if() and df.loop() conditions that relied on 'no rows = false'.

H1 (High): Add recursion depth limit (MAX_GRAPH_DEPTH = 256)

validate_recursive() now rejects graphs exceeding 256 levels of nesting.
This prevents stack overflow from deeply nested operator chains
(e.g., 10,000+ levels of ~> composed in a loop).

H2 (High): Add node count limit (MAX_GRAPH_NODES = 10,000)

insert_nodes() now tracks a running count and rejects graphs with more
than 10,000 nodes. This prevents unbounded INSERT storms, OOM, and
transaction log bloat from extremely large graphs.

H4 (High): Add fallible Durofut::try_from_json()

Added a safe deserialization method that returns Result<Self, String>
instead of panicking. Production code paths that handle potentially
corrupted data should use this instead of from_json().

H7 (High): Add 30-second connection timeout to connect_as_user()

Per-user SQL connections now have a 30-second timeout via
tokio::time::timeout. Previously, a stalled TCP connection could hold
a semaphore permit indefinitely, potentially exhausting all execution
slots and causing a deadlock-like condition.
Add pg_test and unit tests that validate the following fixes don't regress:

C1 - evaluate_condition with empty rows:
  - test_evaluate_condition_empty_rows_is_false: verifies {"rows":[]}
    evaluates as false (was previously true due to non-empty object check)
  - test_evaluate_condition_single_true_row: baseline truthy case
  - test_evaluate_condition_single_false_row: baseline falsy case
  - test_evaluate_condition_zero_count_is_falsy: numeric zero in row

H1 - Recursion depth limit:
  - test_validate_rejects_graph_exceeding_depth_limit: builds a chain of
    MAX_GRAPH_DEPTH+1 nested THEN nodes and verifies validate_recursive()
    returns an error mentioning the depth limit
  - test_validate_accepts_graph_within_depth_limit: 10-level chain passes

H2 - Node count limit:
  - test_start_rejects_graph_exceeding_node_count: constructs a graph
    with >MAX_GRAPH_NODES nodes via chained ~> operators and verifies
    df.start() rejects it

H4 - Fallible try_from_json:
  - test_try_from_json_invalid_json_returns_error: garbage input -> Err
  - test_try_from_json_valid_json_succeeds: valid Durofut JSON -> Ok
  - test_try_from_json_corrupted_node_type_returns_error: valid JSON but
    not a valid Durofut structure -> Err

Also adds empty-rows test cases to the existing evaluate_condition_json_rows
unit test in types.rs.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses several correctness and reliability findings in pg_durable by tightening workflow-graph validation, hardening JSON parsing, and improving connection behavior, with added regression/unit tests to prevent regressions.

Changes:

  • Fix condition evaluation so an empty SQL result set ({"rows":[]}) is treated as false.
  • Add guardrails against pathological workflow graphs via a maximum nesting depth and maximum node count.
  • Add a fallible Durofut JSON deserializer and a per-user connection timeout in connect_as_user(), plus new tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/types.rs Adds graph limits/constants, condition empty-rows handling, connect_as_user() timeout, Durofut::try_from_json(), and depth validation + unit tests.
src/dsl.rs Enforces MAX_GRAPH_NODES during recursive insertion via a shared node counter.
src/lib.rs Adds regression tests covering the audit findings (condition truthiness, depth limit, node-count limit, try_from_json).

Comment thread src/lib.rs Outdated
Comment on lines +2646 to +2670
fn test_start_rejects_graph_exceeding_node_count() {
use crate::types::MAX_GRAPH_NODES;

// Build a wide graph that exceeds MAX_GRAPH_NODES by chaining sequences.
// Each ~> produces a THEN node + right SQL node, so N sequences = ~2N nodes.
// We need MAX_GRAPH_NODES/2 + 1 sequences to exceed the limit.
let needed = MAX_GRAPH_NODES / 2 + 1;
let mut expr = String::from("'SELECT 1'");
for _ in 0..needed {
expr.push_str(" ~> 'SELECT 1'");
}

// This should fail at df.start() with a node count error.
// Use SPI to call df.start and expect an error.
let sql = format!("SELECT df.start({expr})");
let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
Spi::get_one::<String>(&sql).ok()
}));

// pgrx::error! causes a panic that gets caught here
assert!(
result.is_err(),
"df.start() should reject graph exceeding node count limit"
);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Reworked the test to construct a shallow-but-wide graph using a JOIN node with MAX_GRAPH_NODES extra_nodes entries (stays at depth 1). This ensures we specifically hit the node-count limit without triggering the depth check first. Also moved the node-count enforcement into validate_recursive_inner itself (see comment 2), so the test now calls validate_recursive() directly on the wide struct.

Comment thread src/types.rs
Comment on lines 988 to +992
/// Validate a Durofut node and all its children have valid node_types.
/// Used during insertion in df.start() to catch invalid nested nodes.
pub fn validate_recursive(&self) -> Result<(), String> {
self.validate_recursive_inner(0)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — added a &mut usize node counter to validate_recursive_inner so both depth and node-count are enforced in a single pass. Oversized graphs now fail fast during validation without needing a second traversal in insert_nodes. The insert_nodes counter remains as a defense-in-depth safeguard.

iemejia added 2 commits June 11, 2026 12:59
- Add mutable node counter to validate_recursive_inner so both depth
  and node-count are checked in a single pass (fail-fast).
- Rewrite node-count test to use a shallow-but-wide JOIN graph that
  exceeds MAX_GRAPH_NODES without hitting MAX_GRAPH_DEPTH.
- Add unit tests for node-count validation in src/types.rs.
- Add mutable node counter to validate_recursive_inner so both depth
  and node-count are checked in a single pass (fail-fast).
- Rewrite node-count test to use a shallow-but-wide JOIN graph that
  exceeds MAX_GRAPH_NODES without hitting MAX_GRAPH_DEPTH.
- Add unit tests for node-count validation in src/types.rs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants