Downgrade Hangfire to 1.8.22 and Hangfire.Mongo to 1.12.2#876
Downgrade Hangfire to 1.8.22 and Hangfire.Mongo to 1.12.2#876pmachapman merged 1 commit intomainfrom
Conversation
Enkidu93
left a comment
There was a problem hiding this comment.
Thanks for figuring this out, Peter! If this is a bug in the library, have you noticed other folks hitting this same problem online? Is there an issue or ticket we can create with the maintainers?
@Enkidu93 reviewed 4 files and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on ddaspit).
ddaspit
left a comment
There was a problem hiding this comment.
If you have enough information, it would be good to submit an issue to the repo. I'm guessing you tried version 1.13.1. It looks like it had a fix for some issues related state history.
@ddaspit reviewed 4 files and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
pmachapman
left a comment
There was a problem hiding this comment.
I'm guessing you tried version 1.13.1. It looks like it had a fix for some issues related state history.
Yes, sadly this did not help.
If this is a bug in the library, have you noticed other folks hitting this same problem online? Is there an issue or ticket we can create with the maintainers?
I couldn't find anyone else with the issue. To proceed (assuming this PR fixes the bug in prod as it did for me locally), I would need to spend a reasonable chunk of time actually isolating the bug. I think it could be one of three broad areas:
- A logic issue in Hangfire.Mongo.
- A bug with how we are using MongoDB Atlas, where the StateHistory collection is not kept in sync across shards for the latest state history documents.
- A bug with how we are using Hangfire, where we have two pods accessing the same Hangfire database at the same time.
I think I will need to create a test harness and see if I can replicate the bug outside of Serval and go from there.
@pmachapman made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
ddaspit
left a comment
There was a problem hiding this comment.
It would probably be good to spend at least some time isolating the issue. I don't want to get stuck on an old version of Hangfire indefinitely. You should timebox your investigation.
@ddaspit made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
pmachapman
left a comment
There was a problem hiding this comment.
It would probably be good to spend at least some time isolating the issue. I don't want to get stuck on an old version of Hangfire indefinitely. You should timebox your investigation.
Sounds good - will do.
@pmachapman made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
In my testing, it appears that the
OperationCanceledExceptionthat are cancelling jobs are coming from Hangfire directly, when reading the JobState.It looks like a bug in the implementation for the new
StateHistorycollection in Hangfire.Mongo reading the incorrect state (perhaps because it is reading from a shard in the MongoDB Atlas cluster that does not yet have the latest StateHistory document for the jobId?), so I have downgraded to the version of Hangfire and Hangfire.Mongo before this change was made.In my testing, this stopped the jobs being cancelled.
The main downside of this PR is that the
serval_jobsandmachine_jobscollections will need to be dropped on internal QA and external QA when this PR is deployed.This change is