Skip to content

Dead Letter Queue, Retries for Failed Async Jobs, & Dead Letter Admin Methods #52

@Foxcapades

Description

@Foxcapades

Problem

Frequently async jobs fail due to intermittent network hiccups. Currently, when an async job fails, that is it; that job is dead and must be deleted manually.

Proposal

The proposal to work around this issue and resolve most of the network related job failures is a multi-part approach.

  1. The async platform should be configurable to re-queue and re-attempt an async
    job n times before giving up on that job.
  2. There should be a dead-letter queue where jobs that failed n times go.
  3. The job metadata should include a try-count value that starts at zero and is
    incremented with every attempt to execute that job.
  4. When a job fails n times its message will be pushed to the dead-letter queue.
  5. A new admin methods will be added to view and pop messages from the dead-letter queue.

1 Configurable Retries

A new option will be added to the async platform that controls how many times a job may be retried before the job is considered permanently failed. This option will be read by the job handler and on failure the job handler will either retry or requeue the job into the RabbitMQ queue from which it was popped.

Decision Points

  • Should this be a per queue or per job-type configuration value?
  • Should retries be synchronous, or should failed jobs be pushed back onto the queue?

2 & 4 Dead Letter Queue

A new queue should be added to RabbitMQ for jobs that have failed the configured max number of times. Jobs that are now considered permanently failed will have their message pushed to the dead-letter queue.

Decision Points

  • Should there be a singular dead-letter queue, or should a dead-letter queue be set up for every job-type queue?

3 Job Metadata

The job metadata/message JSON does not allow for arbitrary fields. It will need to be updated to allow for a try count field. This try count field will default to 0. When a job is attempted the try-count value will be incremented. If the job fails, the updated job metadata/message will be pushed to the back of the queue from which it was originally popped.

5 Dead-Letter Queue Operations

There should be at least 2 new methods added to the RabbiMQ wrapper and AsyncPlatform facade.

The first will list all messages on the dead letter queue and push them right back onto the queue after they have been read.

The second will peek the next message with a callback that will return a flag indicating whether the message should be popped from the queue.

AsyncPlatform.nextDeadLetter { message ->
  // Do something
  if (shouldPop)
    return true
  else
    return false
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions