Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout by jsfrerot · Pull Request #7729 · OpenNebula/one

jsfrerot · 2026-06-03T19:54:16Z

Problem

When a backup job runs and a concurrent restic forget --prune (triggered by old image deletion) holds the exclusive Restic repository lock, subsequent VM backups fail immediately with:

unable to create lock in backend: repository is already locked exclusively

This happens because create_repo_if_not_exists runs restic stats which needs to acquire a lock, gets exit code 11 (lock contention), and has no retry logic — it raises immediately.

remove_snapshots and run_with_lock_retry already handle lock contention via retry loops (fixed in #7403 / #7404), but create_repo_if_not_exists was missed.

With RESTIC_PRUNE_MAX_UNUSED=0 (the default), a prune can take 50+ minutes on large repositories, so a 60-minute wait budget is needed.

Root cause

The shell script in create_repo_if_not_exists:

restic stats || restic init

When restic stats exits 11 (locked), the shell runs restic init, which exits non-zero with "config already exists". The Ruby code sees a non-zero exit but not exit code 11, so the existing "retry on code 11" pattern cannot work. The script must be changed to propagate exit code 11 explicitly.

Fix

src/datastore_mad/remotes/restic/restic.rb

Read a new RESTIC_LOCK_RETRIES datastore attribute (default 720, i.e. 720 × 5 s = 60 min) in initialize, replacing the hard-coded 60 in remove_snapshots and run_with_lock_retry.

Restructure the shell script so exit code 11 from restic stats is propagated:

restic stats || { ec=$?; if [ "$ec" -eq 11 ]; then exit 11; fi; restic init; }

Add the same retry loop used elsewhere (rc.code == 11 → sleep → retry) to create_repo_if_not_exists.

src/fireedge/src/modules/constants/translates.js — add ResticLockRetries / ResticLockRetriesConcept i18n keys.

src/fireedge/src/modules/components/Forms/Datastore/CreateForm/Steps/ConfigurationAttributes/Fields/restic.js — add RESTIC_LOCK_RETRIES numeric field to the datastore configuration form.

New datastore attribute

Attribute	Default	Description
`RESTIC_LOCK_RETRIES`	`720`	Number of retries (5 s each) when waiting for an exclusive lock. Default gives a 60-minute wait budget.

Testing

Verified on OpenNebula 6.10.1 with restic 0.17.3 (frontend) / 0.18.0 (backup server):

A daily backup job with 5 VMs was failing every day because restic forget --prune --max-unused 0 held an exclusive lock for ~54 minutes.
After this fix, all 5 VMs backed up successfully in both a manual re-run and the following scheduled run.

Relation to existing issues

Issue	Scope	Status
#7403	`remove_snapshots` retry (broken regex → rc.code==11)	Fixed
#7404	`pull_disks` / `pull_other` retry via `run_with_lock_retry`	Fixed
This PR	`create_repo_if_not_exists` retry (never had any) + unified timeout	New

…lock timeout Introduce RESTIC_LOCK_RETRIES (default 720 × 5 s = 60 min) so that create_repo_if_not_exists, remove_snapshots, and run_with_lock_retry all honour the same configurable wait budget when a concurrent prune holds the exclusive Restic lock. Also adds the new attribute to the FireEdge datastore configuration form.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout#7729

Fix: add exclusive lock retry to create_repo_if_not_exists and unify lock timeout#7729
jsfrerot wants to merge 1 commit into
OpenNebula:masterfrom
jsfrerot:fix/restic-create-repo-lock-retry

jsfrerot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jsfrerot commented Jun 3, 2026

Problem

Root cause

Fix

New datastore attribute

Testing

Relation to existing issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant