-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Frequently async jobs fail due to intermittent network hiccups. Currently, when an async job fails, that is it; that job is dead and must be deleted manually.
Proposal
The proposal to work around this issue and resolve most of the network related job failures is a multi-part approach.
- The async platform should be configurable to re-queue and re-attempt an async
jobntimes before giving up on that job. - There should be a dead-letter queue where jobs that failed
ntimes go. - The job metadata should include a try-count value that starts at zero and is
incremented with every attempt to execute that job. - When a job fails
ntimes its message will be pushed to the dead-letter queue. - A new admin methods will be added to view and pop messages from the dead-letter queue.
1 Configurable Retries
A new option will be added to the async platform that controls how many times a job may be retried before the job is considered permanently failed. This option will be read by the job handler and on failure the job handler will either retry or requeue the job into the RabbitMQ queue from which it was popped.
Decision Points
- Should this be a per queue or per job-type configuration value?
- Should retries be synchronous, or should failed jobs be pushed back onto the queue?
2 & 4 Dead Letter Queue
A new queue should be added to RabbitMQ for jobs that have failed the configured max number of times. Jobs that are now considered permanently failed will have their message pushed to the dead-letter queue.
Decision Points
- Should there be a singular dead-letter queue, or should a dead-letter queue be set up for every job-type queue?
3 Job Metadata
The job metadata/message JSON does not allow for arbitrary fields. It will need to be updated to allow for a try count field. This try count field will default to 0. When a job is attempted the try-count value will be incremented. If the job fails, the updated job metadata/message will be pushed to the back of the queue from which it was originally popped.
5 Dead-Letter Queue Operations
There should be at least 2 new methods added to the RabbiMQ wrapper and AsyncPlatform facade.
The first will list all messages on the dead letter queue and push them right back onto the queue after they have been read.
The second will peek the next message with a callback that will return a flag indicating whether the message should be popped from the queue.
AsyncPlatform.nextDeadLetter { message ->
// Do something
if (shouldPop)
return true
else
return false
}