Skip to content

feat(queue): reconnect commands/publisher path on Redis outage#34

Open
abnegate wants to merge 1 commit into
mainfrom
feat/queue-publisher-reconnect
Open

feat(queue): reconnect commands/publisher path on Redis outage#34
abnegate wants to merge 1 commit into
mainfrom
feat/queue-publisher-reconnect

Conversation

@abnegate

Copy link
Copy Markdown
Member

What

The Broker\Redis receive loop already survives a Redis/Dragonfly restart: on RedisException/RedisClusterException it closes the dead connection, backs off, and reconnects on the next poll. The commands connection — which backs enqueue, commit, reject, retry, and getQueueSize — had no equivalent handling. A pooled connection that died between checkout and reuse threw straight through, dropping the publish or ack.

This wraps those operations in a new executeCommand() helper that mirrors receive()'s bounded reconnect (close + jittered exponential backoff, reusing the existing setReconnectCallback/setReconnectSuccessCallback hooks), but retries inline since publishing/acks have no outer poll loop. Capped at COMMAND_MAX_ATTEMPTS (3); a non-connection error (e.g. a RuntimeException) propagates immediately without retrying.

Why

Appwrite Cloud's queue publisher is a pooled Broker\Redis used purely for enqueue. Today a Dragonfly failover silently drops enqueues on stale pooled connections — the cache adapter already reconnects transparently, the queue publisher does not. This closes that gap at the source.

Tests

Added publisher/ack-path cases to RedisReconnectCallbackTest (unit tier, in-memory Connection doubles, no live Redis):

  • testEnqueueReconnectsAndSucceeds — fail once, reconnect, succeed; success callback fires with attempt count
  • testEnqueueThrowsAfterExhaustingReconnects — 3 attempts, 2 reconnects, then rethrow
  • testCommitReconnectsAndSucceeds — ack path reconnects
  • testEnqueueDoesNotRetryNonConnectionErrorsRuntimeException propagates, no reconnect

3 of the 4 fail on main and pass with the change. composer test (38 tests), Pint, and PHPStan (level 5) all green locally.

🤖 Generated with Claude Code

The Broker receive loop already survives a Redis/Dragonfly restart by
closing the dead connection and reconnecting on the next poll. The
commands connection backing enqueue, commit, reject, retry and
getQueueSize had no such handling: a connection that died between
checkout and reuse threw straight through, dropping the publish or ack.

Wrap those operations in executeCommand(), which mirrors receive()'s
bounded reconnect (close + jittered exponential backoff, reusing the
reconnect callbacks) but retries inline since there is no outer loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 25, 2026 06:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a executeCommand() helper to Broker\Redis that mirrors the existing receive() reconnect logic (close + jittered exponential backoff, up to COMMAND_MAX_ATTEMPTS = 3) for the commands connection, protecting enqueue, commit, reject, retry, and getQueueSize from silent drops on stale pooled connections.

  • enqueue, getQueueSize, and the retry() inner pop are each single-operation closures and reconnect correctly.
  • commit() and reject() each bundle four Redis commands in one closure; if the connection drops after increment/decrement has already executed at the Redis level, the retry re-runs the whole closure and double-fires those counter mutations, corrupting queue stats.
  • Four new unit tests exercise publisher reconnect, reconnect exhaustion, commit reconnect, and non-Redis error propagation, but all connection doubles throw before the first command succeeds, so mid-closure partial execution is not covered.

Confidence Score: 3/5

The enqueue and single-command paths are safe to merge; the commit/reject multi-command closures can silently double-count stats counters on a mid-closure Redis failure.

The commit() and reject() closures wrap four Redis commands — including increment and decrement — in a single retriable unit. If Redis acknowledges the counter mutation but the connection drops before the next command, the retry re-issues the entire closure, double-firing the counter. This corrupts success/failed/processing stats permanently and silently, with no existing test able to catch it.

packages/queue/src/Queue/Broker/Redis.php — specifically the commit() and reject() closures passed to executeCommand().

Important Files Changed

Filename Overview
packages/queue/src/Queue/Broker/Redis.php Adds executeCommand() helper with bounded retry/backoff for the commands connection. Single-command wrappers are safe, but the multi-step commit() and reject() closures are non-idempotent: a mid-closure connection drop causes stats counters to double-fire on retry.
packages/queue/tests/Queue/E2E/Adapter/RedisReconnectCallbackTest.php Four new unit tests cover enqueue reconnect/exhaustion, commit reconnect, and non-Redis error propagation. All test doubles only exercise pre-first-command failure; mid-closure failure (the scenario where stats can double-count) is not covered.

Reviews (1): Last reviewed commit: "feat(queue): reconnect commands/publishe..." | Re-trigger Greptile

Comment on lines +108 to +113
$this->executeCommand($queue, function () use ($queue, $pid): void {
$this->commands->remove("{$queue->namespace}.jobs.{$queue->name}.{$pid}");
$this->commands->increment("{$queue->namespace}.stats.{$queue->name}.success");
$this->commands->listRemove("{$queue->namespace}.processing.{$queue->name}", $pid);
$this->commands->decrement("{$queue->namespace}.stats.{$queue->name}.processing");
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Retry re-executes partially-completed closures, corrupting stats counters

If the connection drops after remove() has already been acknowledged by Redis but before increment() is sent, executeCommand() retries the entire closure. On the second run remove() is a safe no-op, but increment("{…}.success") and decrement("{…}.processing") both fire a second time — permanently over-counting successes and under-counting processing. The same applies to reject() (increment("{…}.failed") / decrement("{…}.processing")).

The root cause is that four non-idempotent Redis commands are batched into one closure that the outer retry loop treats as atomic. Wrapping each individual command in its own executeCommand() call would avoid re-running already-successful counter mutations on reconnect. Alternatively, a Redis pipeline/MULTI-EXEC transaction would make the whole block genuinely atomic at the Redis level.

Comment on lines +120 to +141

// COMMAND_MAX_ATTEMPTS (3): 3 pushes, 2 reconnects between them.
$this->assertSame(3, $connection->pushAttempts);
$this->assertSame(2, $reconnects);
}

public function testCommitReconnectsAndSucceeds(): void
{
$queue = new Queue('ack-reconnect');
$connection = new FailOnceAckConnection();
$broker = new RedisBroker($connection, $connection);
$reconnects = 0;

$broker->setReconnectCallback(function () use (&$reconnects): void {
$reconnects++;
});

$broker->commit($queue, new Message(['pid' => 'job-1', 'queue' => 'ack-reconnect', 'timestamp' => 0, 'payload' => []]));

$this->assertSame(2, $connection->removeAttempts);
$this->assertSame(1, $connection->closeCalls);
$this->assertSame(1, $reconnects);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 testCommitReconnectsAndSucceeds only covers pre-first-command failure, masking the mid-closure double-count bug

FailOnceAckConnection.remove() throws on the very first call, so the entire closure never partially succeeds. The test therefore cannot detect that increment and decrement run twice when the connection drops between commands. Adding a connection double that lets remove() succeed but throws on increment() would expose the stats double-counting described in the production code comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants