Skip to content

fix(linstor): pre-flight check destination is a LINSTOR satellite before live migration#13077

Open
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/linstor-migration-preflight
Open

fix(linstor): pre-flight check destination is a LINSTOR satellite before live migration#13077
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/linstor-migration-preflight

Conversation

@jmsperu
Copy link
Copy Markdown
Contributor

@jmsperu jmsperu commented Apr 28, 2026

Summary

LinstorDataMotionStrategy.copyAsync (added in #12830 / Jan 29 2026) calls LinstorUtil.createResource on the destination pool's controller without first verifying the destination KVM host is registered as a LINSTOR satellite there. Two failure modes follow:

  1. Wrong-node placement. The resource group's auto-placement filter picks a different registered satellite that is NOT the migration destination. The resource silently lands on the wrong node. The subsequent migrate then fails because the destination KVM host has no DRBD device for the resource.
  2. Opaque auto-place failure. The auto-placement filter has no candidates and the LINSTOR API returns a generic error. Operator has to correlate the migration failure with an unrelated controller log entry to understand what went wrong.

We've hit (1) on host-5 → host-6 migrations where host-6 was not registered as a LINSTOR satellite. The migration logs were unhelpful.

Change

Adds verifyDestinationIsLinstorSatellite() called at the top of copyAsync. For each LINSTOR-typed pool in volumeDataStoreMap:

  • Fetches the controller's node list via LinstorUtil.getLinstorNodeNames
  • If destHost.getName() is missing from the list, throws CloudRuntimeException with a clear actionable message that lists the known satellites and tells the operator to either linstor node create or pick a different destination
  • Silently skips on transient controller errors (so a controller hiccup doesn't block an otherwise valid migration — the existing flow surfaces the real failure downstream)

Non-LINSTOR destination pools are skipped (mixed-storage migrations are unaffected).

Test plan

  • CI build + unit tests
  • Manual: migrate a VM to a host that IS a LINSTOR satellite — works as before, no behaviour change
  • Manual: migrate a VM to a host that is NOT a LINSTOR satellite — fails fast with the new error message naming the host and listing known satellites

…ore live migration

LinstorDataMotionStrategy.copyAsync would call createResource on the
destination pool's controller without first verifying that the
destination KVM host is registered as a LINSTOR satellite there. Two
failure modes:

  1. The resource group's auto-placement filter happens to match a
     different node (a registered satellite that is NOT the migration
     destination), and the resource is silently created on the wrong
     node. The subsequent migrate then fails because the destination
     KVM host has no DRBD device for the resource.

  2. The auto-placement filter has no candidates and the LINSTOR API
     returns an opaque error. The operator has to correlate the
     migration failure with an unrelated controller log entry to
     understand what happened.

This change adds verifyDestinationIsLinstorSatellite() called at the
top of copyAsync. For each LINSTOR-typed destination pool it:

  - fetches the controller's node list via LinstorUtil.getLinstorNodeNames
  - throws CloudRuntimeException with a clear actionable message
    (lists known satellites) if destHost.getName() is missing from
    that list
  - silently skips on transient controller errors so a network blip
    against the controller doesn't block an otherwise valid migration

Non-LINSTOR destination pools in the volumeDataStoreMap are skipped
(mixed-storage migrations are unaffected).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pre-flight validation to LINSTOR live migration to ensure the destination KVM host is registered as a LINSTOR satellite on the relevant controller(s), so migrations fail fast with an actionable error instead of failing later via LINSTOR auto-placement.

Changes:

  • Introduces verifyDestinationIsLinstorSatellite(...) to query LINSTOR node lists per destination pool/controller.
  • Calls the pre-flight check at the start of copyAsync(...) to abort early with a clear CloudRuntimeException when the destination host is not a satellite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +358 to +365
} catch (ApiException apiEx) {
// Don't block migration on a transient controller hiccup — log and let the
// downstream resource creation handle the real failure.
logger.warn("LINSTOR pre-flight check could not contact controller {}: {}; " +
"letting downstream resource creation proceed",
destStoragePool.getHostAddress(), apiEx.getBestMessage());
return;
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch (ApiException) path currently logs and then returns, which stops checking any additional destination pools/controllers in volumeDataStoreMap. To match the method Javadoc/PR description (“best-effort” per pool), this should usually continue so that other pools can still be validated and can fail fast if the destination host is not a satellite there.

Copilot uses AI. Check for mistakes.
Comment on lines +378 to +386
// Pre-flight: verify the destination KVM host is registered as a satellite on the
// LINSTOR controller backing each destination pool. Without this check, resource
// creation falls through to the resource-group's auto-placement filters and may
// either silently place the resource on the wrong node or fail with an opaque
// auto-place error from the LINSTOR API. Failing fast here gives operators a clear
// actionable message instead of having to correlate the live-migration failure with
// an unrelated LINSTOR controller log entry.
verifyDestinationIsLinstorSatellite(volumeDataStoreMap, destHost);

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verifyDestinationIsLinstorSatellite(...) is invoked before the method’s try/catch/finally. If it throws CloudRuntimeException (the intended failure-fast path), copyAsync will exit without reaching the finally block and callback.complete(...) will never be called, potentially leaving the migration task hanging. Move this pre-flight call inside the existing try so failures are handled consistently and the callback is always completed.

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +347
if (nodes == null) {
logger.warn("LINSTOR controller {} returned null node list; skipping pre-flight",
destStoragePool.getHostAddress());
return;
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the loop, the method returns when nodes == null. That exits the entire pre-flight check and skips verifying any remaining destination pools/volumes. If the intent is best-effort per pool, this should continue to the next entry (or better: de-duplicate pools/controllers and keep checking the rest).

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 0% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.02%. Comparing base (6f4445c) to head (1a1f8ff).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...tack/storage/motion/LinstorDataMotionStrategy.java 0.00% 29 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13077      +/-   ##
============================================
- Coverage     18.02%   18.02%   -0.01%     
+ Complexity    16621    16620       -1     
============================================
  Files          6029     6029              
  Lines        542184   542210      +26     
  Branches      66451    66455       +4     
============================================
- Hits          97740    97737       -3     
- Misses       433428   433457      +29     
  Partials      11016    11016              
Flag Coverage Δ
uitests 3.52% <ø> (ø)
unittests 19.18% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants