Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout#7729
Open
jsfrerot wants to merge 1 commit into
Open
Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout#7729jsfrerot wants to merge 1 commit into
jsfrerot wants to merge 1 commit into
Conversation
…lock timeout Introduce RESTIC_LOCK_RETRIES (default 720 × 5 s = 60 min) so that create_repo_if_not_exists, remove_snapshots, and run_with_lock_retry all honour the same configurable wait budget when a concurrent prune holds the exclusive Restic lock. Also adds the new attribute to the FireEdge datastore configuration form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a backup job runs and a concurrent
restic forget --prune(triggered by old image deletion) holds the exclusive Restic repository lock, subsequent VM backups fail immediately with:This happens because
create_repo_if_not_existsrunsrestic statswhich needs to acquire a lock, gets exit code 11 (lock contention), and has no retry logic — it raises immediately.remove_snapshotsandrun_with_lock_retryalready handle lock contention via retry loops (fixed in #7403 / #7404), butcreate_repo_if_not_existswas missed.With
RESTIC_PRUNE_MAX_UNUSED=0(the default), a prune can take 50+ minutes on large repositories, so a 60-minute wait budget is needed.Root cause
The shell script in
create_repo_if_not_exists:restic stats || restic initWhen
restic statsexits 11 (locked), the shell runsrestic init, which exits non-zero with "config already exists". The Ruby code sees a non-zero exit but not exit code 11, so the existing "retry on code 11" pattern cannot work. The script must be changed to propagate exit code 11 explicitly.Fix
src/datastore_mad/remotes/restic/restic.rbRead a new
RESTIC_LOCK_RETRIESdatastore attribute (default720, i.e. 720 × 5 s = 60 min) ininitialize, replacing the hard-coded60inremove_snapshotsandrun_with_lock_retry.Restructure the shell script so exit code 11 from
restic statsis propagated:Add the same retry loop used elsewhere (
rc.code == 11→ sleep → retry) tocreate_repo_if_not_exists.src/fireedge/src/modules/constants/translates.js— addResticLockRetries/ResticLockRetriesConcepti18n keys.src/fireedge/src/modules/components/Forms/Datastore/CreateForm/Steps/ConfigurationAttributes/Fields/restic.js— addRESTIC_LOCK_RETRIESnumeric field to the datastore configuration form.New datastore attribute
RESTIC_LOCK_RETRIES720Testing
Verified on OpenNebula 6.10.1 with restic 0.17.3 (frontend) / 0.18.0 (backup server):
restic forget --prune --max-unused 0held an exclusive lock for ~54 minutes.Relation to existing issues
remove_snapshotsretry (broken regex → rc.code==11)pull_disks/pull_otherretry viarun_with_lock_retrycreate_repo_if_not_existsretry (never had any) + unified timeout