Skip to content

Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout#7729

Open
jsfrerot wants to merge 1 commit into
OpenNebula:masterfrom
jsfrerot:fix/restic-create-repo-lock-retry
Open

Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout#7729
jsfrerot wants to merge 1 commit into
OpenNebula:masterfrom
jsfrerot:fix/restic-create-repo-lock-retry

Conversation

@jsfrerot

@jsfrerot jsfrerot commented Jun 3, 2026

Copy link
Copy Markdown

Problem

When a backup job runs and a concurrent restic forget --prune (triggered by old image deletion) holds the exclusive Restic repository lock, subsequent VM backups fail immediately with:

unable to create lock in backend: repository is already locked exclusively

This happens because create_repo_if_not_exists runs restic stats which needs to acquire a lock, gets exit code 11 (lock contention), and has no retry logic — it raises immediately.

remove_snapshots and run_with_lock_retry already handle lock contention via retry loops (fixed in #7403 / #7404), but create_repo_if_not_exists was missed.

With RESTIC_PRUNE_MAX_UNUSED=0 (the default), a prune can take 50+ minutes on large repositories, so a 60-minute wait budget is needed.

Root cause

The shell script in create_repo_if_not_exists:

restic stats || restic init

When restic stats exits 11 (locked), the shell runs restic init, which exits non-zero with "config already exists". The Ruby code sees a non-zero exit but not exit code 11, so the existing "retry on code 11" pattern cannot work. The script must be changed to propagate exit code 11 explicitly.

Fix

src/datastore_mad/remotes/restic/restic.rb

  1. Read a new RESTIC_LOCK_RETRIES datastore attribute (default 720, i.e. 720 × 5 s = 60 min) in initialize, replacing the hard-coded 60 in remove_snapshots and run_with_lock_retry.

  2. Restructure the shell script so exit code 11 from restic stats is propagated:

    restic stats || { ec=$?; if [ "$ec" -eq 11 ]; then exit 11; fi; restic init; }
  3. Add the same retry loop used elsewhere (rc.code == 11 → sleep → retry) to create_repo_if_not_exists.

src/fireedge/src/modules/constants/translates.js — add ResticLockRetries / ResticLockRetriesConcept i18n keys.

src/fireedge/src/modules/components/Forms/Datastore/CreateForm/Steps/ConfigurationAttributes/Fields/restic.js — add RESTIC_LOCK_RETRIES numeric field to the datastore configuration form.

New datastore attribute

Attribute Default Description
RESTIC_LOCK_RETRIES 720 Number of retries (5 s each) when waiting for an exclusive lock. Default gives a 60-minute wait budget.

Testing

Verified on OpenNebula 6.10.1 with restic 0.17.3 (frontend) / 0.18.0 (backup server):

  • A daily backup job with 5 VMs was failing every day because restic forget --prune --max-unused 0 held an exclusive lock for ~54 minutes.
  • After this fix, all 5 VMs backed up successfully in both a manual re-run and the following scheduled run.

Relation to existing issues

Issue Scope Status
#7403 remove_snapshots retry (broken regex → rc.code==11) Fixed
#7404 pull_disks / pull_other retry via run_with_lock_retry Fixed
This PR create_repo_if_not_exists retry (never had any) + unified timeout New

…lock timeout

Introduce RESTIC_LOCK_RETRIES (default 720 × 5 s = 60 min) so that
create_repo_if_not_exists, remove_snapshots, and run_with_lock_retry
all honour the same configurable wait budget when a concurrent prune
holds the exclusive Restic lock.  Also adds the new attribute to the
FireEdge datastore configuration form.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant