Skip to content

fix(workers): commit database writes that were silently rolled back#50

Merged
ademboukabes merged 7 commits into
mainfrom
fix/worker-db-commits
Jun 1, 2026
Merged

fix(workers): commit database writes that were silently rolled back#50
ademboukabes merged 7 commits into
mainfrom
fix/worker-db-commits

Conversation

@edaywalid
Copy link
Copy Markdown
Collaborator

@edaywalid edaywalid commented Jun 1, 2026

Problem

The background workers open their database connection with engine.connect(),
which in SQLAlchemy 2.0 async does not autocommit. None of them ever issued
a commit, so every write they made was rolled back when the connection closed.
Only the request path (get_db) and the upload-group worker used
engine.begin(), which commits.

Confirmed against SQLAlchemy 2.0.47 with Postgres:

  • A write on an engine.connect() connection does not persist without an
    explicit commit().
  • After any statement has run (autobegin), calling conn.begin() raises
    InvalidRequestError. This is why the single-face path silently failed:
    create_processing_job opened the transaction first, then
    SingleFaceMatchService tried conn.begin() again and the error was
    swallowed by a broad except.
  • Writes made after a begin() block (the match status update and
    notification in SingleFaceMatchService) were left in a fresh, uncommitted
    transaction.

Impact: photos were processed but their status/visibility, detected faces, and
match records were never saved; audit events were dropped entirely; devices
flagged for invalid push tokens were never marked.

Changes

  • photo-worker: each message runs in one self._conn.begin() transaction,
    so the processing job, detected faces, photo status/visibility, and the
    single-face match commit (or roll back) together. Messages are delivered one
    at a time, so the shared connection is only ever used by one handler.
  • face-match: SingleFaceMatchService no longer opens its own transaction.
    It runs inside the worker's transaction, which removes the nested-begin()
    crash and the lost post-block writes.
  • audit-worker: commit after each persisted event, roll back on failure.
  • notification-worker: DeviceInvalidationStore now writes in its own
    engine.begin() transaction, one per token. This commits the write and stops
    sharing a single connection across the worker's concurrent handlers.

Testing

  • make lint (ruff) - passes
  • make check_type (mypy, 136 files) - passes
  • pytest app/worker/photo_worker/tests - 7 passed, 5 deselected. The deselected
    tests reach a real NatsClient.connect() and need a live NATS broker, which
    isn't available in my sandbox; they don't touch the changed code. The 7 that
    ran cover the no-faces, single-face, and group paths through the new
    per-message transaction.

edaywalid added 3 commits June 1, 2026 01:01
The photo worker opened its connection with engine.connect(), which does
not autocommit, and never issued a commit. Every write in the message
handler (processing job status, detected faces, photo status/visibility,
and the single-face match) was rolled back when the connection closed, so
photos were processed but their results never persisted.

Wrap each message in a single transaction via self._conn.begin(). Messages
are delivered one at a time, so the shared connection is only ever used by
one handler at a time.

SingleFaceMatchService previously managed its own transaction with
self.conn.begin(). That nested begin() raised once create_processing_job
had already opened the connection's transaction, and any writes it made
after the block were left uncommitted. It now runs inside the worker's
transaction instead.
The audit worker ran record_event on a connection opened with
engine.connect() and never committed, so every audit row was discarded
when the connection eventually closed. Commit after each event and roll
back on failure. The worker processes events one at a time, so the shared
connection is safe to commit per message.
DeviceInvalidationStore wrote through a connection opened with
engine.connect() and never committed, so devices flagged for an invalid
push token were never persisted. The worker also runs handlers
concurrently against that single shared connection, which is not safe.

Give each write its own transaction with engine.begin(). One transaction
per token keeps a failure on one token from aborting the others, and
removes the shared connection from the worker entrypoint.
Copy link
Copy Markdown
Collaborator

@ademboukabes ademboukabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection handling in photo-worker and audit-worker introduced two critical stability issues:

1. Database Disconnects
By initializing self._conn once in run_worker(), the connection is kept alive forever. If PostgreSQL restarts, the worker is stuck looping on OperationalError because the dead socket is never renewed.

2. Silent Crashes (PendingRollbackError)
In photo_worker, SQL errors (like ForeignKeyViolationError) were swallowed by try...except blocks in helpers like _create_job. Because the exception didn't bubble up, the async with self._conn.begin() block tried to COMMIT an aborted transaction, causing a fatal PendingRollbackError.


I suggest applying the following structural fixes:

  1. pool_pre_ping=True (database.py): Add this to the async engine creation. It auto-recovers dead connections upon network drops.
  2. Per-Message Transactions (photo_worker / audit_worker): Remove the global connection self._conn entirely. Use async with engine.begin() as conn: inside handle_message instead.
  3. Error Propagation (photo_worker/main.py): Remove try...except inside SQL helpers (like _create_job) so invalid transactions trigger an automatic ROLLBACK, avoiding PendingRollbackError. Wrap the main logic in handle_message in a try...except to catch and log the failure without crashing the worker.

edaywalid added 3 commits June 1, 2026 11:41
Without pre-ping, a pooled connection that died while idle (for example after
a Postgres restart or an idle-timeout cut by a proxy) is handed out as-is and
the next query fails with a stale-socket error. pool_pre_ping validates the
connection on checkout and transparently replaces a dead one, which the workers
rely on now that they check out a fresh connection per message.
Follow-up to the commit fix, addressing review feedback.

The worker held a single connection for its whole lifetime, so a Postgres
restart left it looping on a dead socket. It now opens a fresh connection and
transaction per message with engine.begin(), which commits on success and rolls
back on failure; combined with pool_pre_ping the connection is revalidated on
checkout.

The SQL helpers (_create_job, _update_job) and the per-face insert used to
catch and swallow DB errors. Inside one transaction that is unsafe: the first
failed statement aborts the whole transaction, and continuing would either hit
"current transaction is aborted" on the next statement or raise
PendingRollbackError at commit. Those swallows are removed so a DB error
propagates and the transaction rolls back cleanly. The per-message handler in
run_worker wraps everything in try/except and logs, so one bad message no
longer tears down the subscription.
Replace the long-lived connection with a fresh engine.begin() per event, for
the same reasons as the photo worker: auto-commit/rollback per message and
recovery from dropped connections via pool_pre_ping. Removes the manual
start/stop connection lifecycle.
@edaywalid
Copy link
Copy Markdown
Collaborator Author

Good catches, both of these are real. Addressed in 3c12a6a, c0258aa, 33f6bd5.

1. Dead connections — agreed. The workers held one connection for the whole process, so a Postgres restart left them on a dead socket. Both photo and audit workers now open a fresh connection + transaction per message with engine.begin() (matching the upload-group worker), and I enabled pool_pre_ping=True on the engine so a stale connection is revalidated/replaced on checkout.

2. PendingRollbackError — agreed, and it's exactly the trap with the previous version: a swallowed DB error aborts the transaction, then the wrapping block tries to COMMIT and blows up. Removed the try/except swallows in _create_job, _update_job, and the per-face insert so a DB error propagates and the transaction rolls back cleanly. The per-message handler in run_worker wraps the whole thing in try/except and logs, so one bad message doesn't kill the subscription.

Net effect per message: fresh connection → engine.begin() → on any DB error, propagate → automatic ROLLBACK → logged at the top level → worker keeps running.

Checks: make lint and make check_type pass; photo-worker tests 7 passed / 5 deselected (the 5 need a live NATS broker, which I can't run in my sandbox).

One thing I left as a follow-up rather than fold in here: the worker still publishes its NATS audit/cleanup events from inside the transaction, so a publish can precede a commit. Moving those to publish-after-commit (outbox) is worth doing but felt out of scope for this fix — happy to do it in a separate PR if you'd prefer.

@ademboukabes ademboukabes merged commit 5991b79 into main Jun 1, 2026
1 check passed
@ademboukabes ademboukabes deleted the fix/worker-db-commits branch June 1, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants