fix(workers): commit database writes that were silently rolled back by edaywalid · Pull Request #50 · MicroClub-USTHB/multAI-back

edaywalid · 2026-06-01T00:34:58Z

Problem

The background workers open their database connection with engine.connect(),
which in SQLAlchemy 2.0 async does not autocommit. None of them ever issued
a commit, so every write they made was rolled back when the connection closed.
Only the request path (get_db) and the upload-group worker used
engine.begin(), which commits.

Confirmed against SQLAlchemy 2.0.47 with Postgres:

A write on an engine.connect() connection does not persist without an
explicit commit().
After any statement has run (autobegin), calling conn.begin() raises
InvalidRequestError. This is why the single-face path silently failed:
create_processing_job opened the transaction first, then
SingleFaceMatchService tried conn.begin() again and the error was
swallowed by a broad except.
Writes made after a begin() block (the match status update and
notification in SingleFaceMatchService) were left in a fresh, uncommitted
transaction.

Impact: photos were processed but their status/visibility, detected faces, and
match records were never saved; audit events were dropped entirely; devices
flagged for invalid push tokens were never marked.

Changes

photo-worker: each message runs in one self._conn.begin() transaction,
so the processing job, detected faces, photo status/visibility, and the
single-face match commit (or roll back) together. Messages are delivered one
at a time, so the shared connection is only ever used by one handler.
face-match: SingleFaceMatchService no longer opens its own transaction.
It runs inside the worker's transaction, which removes the nested-begin()
crash and the lost post-block writes.
audit-worker: commit after each persisted event, roll back on failure.
notification-worker: DeviceInvalidationStore now writes in its own
engine.begin() transaction, one per token. This commits the write and stops
sharing a single connection across the worker's concurrent handlers.

Testing

make lint (ruff) - passes
make check_type (mypy, 136 files) - passes
pytest app/worker/photo_worker/tests - 7 passed, 5 deselected. The deselected
tests reach a real NatsClient.connect() and need a live NATS broker, which
isn't available in my sandbox; they don't touch the changed code. The 7 that
ran cover the no-faces, single-face, and group paths through the new
per-message transaction.

The photo worker opened its connection with engine.connect(), which does not autocommit, and never issued a commit. Every write in the message handler (processing job status, detected faces, photo status/visibility, and the single-face match) was rolled back when the connection closed, so photos were processed but their results never persisted. Wrap each message in a single transaction via self._conn.begin(). Messages are delivered one at a time, so the shared connection is only ever used by one handler at a time. SingleFaceMatchService previously managed its own transaction with self.conn.begin(). That nested begin() raised once create_processing_job had already opened the connection's transaction, and any writes it made after the block were left uncommitted. It now runs inside the worker's transaction instead.

The audit worker ran record_event on a connection opened with engine.connect() and never committed, so every audit row was discarded when the connection eventually closed. Commit after each event and roll back on failure. The worker processes events one at a time, so the shared connection is safe to commit per message.

DeviceInvalidationStore wrote through a connection opened with engine.connect() and never committed, so devices flagged for an invalid push token were never persisted. The worker also runs handlers concurrently against that single shared connection, which is not safe. Give each write its own transaction with engine.begin(). One transaction per token keeps a failure on one token from aborting the others, and removes the shared connection from the worker entrypoint.

ademboukabes

The connection handling in photo-worker and audit-worker introduced two critical stability issues:

1. Database Disconnects
By initializing self._conn once in run_worker(), the connection is kept alive forever. If PostgreSQL restarts, the worker is stuck looping on OperationalError because the dead socket is never renewed.

2. Silent Crashes (PendingRollbackError)
In photo_worker, SQL errors (like ForeignKeyViolationError) were swallowed by try...except blocks in helpers like _create_job. Because the exception didn't bubble up, the async with self._conn.begin() block tried to COMMIT an aborted transaction, causing a fatal PendingRollbackError.

I suggest applying the following structural fixes:

pool_pre_ping=True (database.py): Add this to the async engine creation. It auto-recovers dead connections upon network drops.
Per-Message Transactions (photo_worker / audit_worker): Remove the global connection self._conn entirely. Use async with engine.begin() as conn: inside handle_message instead.
Error Propagation (photo_worker/main.py): Remove try...except inside SQL helpers (like _create_job) so invalid transactions trigger an automatic ROLLBACK, avoiding PendingRollbackError. Wrap the main logic in handle_message in a try...except to catch and log the failure without crashing the worker.

Without pre-ping, a pooled connection that died while idle (for example after a Postgres restart or an idle-timeout cut by a proxy) is handed out as-is and the next query fails with a stale-socket error. pool_pre_ping validates the connection on checkout and transparently replaces a dead one, which the workers rely on now that they check out a fresh connection per message.

Follow-up to the commit fix, addressing review feedback. The worker held a single connection for its whole lifetime, so a Postgres restart left it looping on a dead socket. It now opens a fresh connection and transaction per message with engine.begin(), which commits on success and rolls back on failure; combined with pool_pre_ping the connection is revalidated on checkout. The SQL helpers (_create_job, _update_job) and the per-face insert used to catch and swallow DB errors. Inside one transaction that is unsafe: the first failed statement aborts the whole transaction, and continuing would either hit "current transaction is aborted" on the next statement or raise PendingRollbackError at commit. Those swallows are removed so a DB error propagates and the transaction rolls back cleanly. The per-message handler in run_worker wraps everything in try/except and logs, so one bad message no longer tears down the subscription.

Replace the long-lived connection with a fresh engine.begin() per event, for the same reasons as the photo worker: auto-commit/rollback per message and recovery from dropped connections via pool_pre_ping. Removes the manual start/stop connection lifecycle.

edaywalid · 2026-06-01T10:44:56Z

Good catches, both of these are real. Addressed in 3c12a6a, c0258aa, 33f6bd5.

1. Dead connections — agreed. The workers held one connection for the whole process, so a Postgres restart left them on a dead socket. Both photo and audit workers now open a fresh connection + transaction per message with engine.begin() (matching the upload-group worker), and I enabled pool_pre_ping=True on the engine so a stale connection is revalidated/replaced on checkout.

2. PendingRollbackError — agreed, and it's exactly the trap with the previous version: a swallowed DB error aborts the transaction, then the wrapping block tries to COMMIT and blows up. Removed the try/except swallows in _create_job, _update_job, and the per-face insert so a DB error propagates and the transaction rolls back cleanly. The per-message handler in run_worker wraps the whole thing in try/except and logs, so one bad message doesn't kill the subscription.

Net effect per message: fresh connection → engine.begin() → on any DB error, propagate → automatic ROLLBACK → logged at the top level → worker keeps running.

Checks: make lint and make check_type pass; photo-worker tests 7 passed / 5 deselected (the 5 need a live NATS broker, which I can't run in my sandbox).

One thing I left as a follow-up rather than fold in here: the worker still publishes its NATS audit/cleanup events from inside the transaction, so a publish can precede a commit. Moving those to publish-after-commit (outbox) is worth doing but felt out of scope for this fix — happy to do it in a separate PR if you'd prefer.

…id RuntimeError

edaywalid added 3 commits June 1, 2026 01:01

ademboukabes reviewed Jun 1, 2026

View reviewed changes

edaywalid added 3 commits June 1, 2026 11:41

fix(photo-worker): use RedisClient.init instead of constructor to avo…

bec23c6

…id RuntimeError

ademboukabes merged commit 5991b79 into main Jun 1, 2026
1 check passed

ademboukabes deleted the fix/worker-db-commits branch June 1, 2026 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workers): commit database writes that were silently rolled back#50

fix(workers): commit database writes that were silently rolled back#50
ademboukabes merged 7 commits into
mainfrom
fix/worker-db-commits

edaywalid commented Jun 1, 2026 •

edited

Loading

Uh oh!

ademboukabes left a comment

Uh oh!

edaywalid commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edaywalid commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Testing

Uh oh!

ademboukabes left a comment

Choose a reason for hiding this comment

I suggest applying the following structural fixes:

Uh oh!

edaywalid commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edaywalid commented Jun 1, 2026 •

edited

Loading