diff --git a/docs/en/solutions/VM_live_migration_hangs_at_99_when_memory_dirty_rate_exceeds_bandwidth.md b/docs/en/solutions/VM_live_migration_hangs_at_99_when_memory_dirty_rate_exceeds_bandwidth.md new file mode 100644 index 00000000..4e62e78c --- /dev/null +++ b/docs/en/solutions/VM_live_migration_hangs_at_99_when_memory_dirty_rate_exceeds_bandwidth.md @@ -0,0 +1,141 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +A live migration of a VirtualMachineInstance on ACP virtualization reports progress but never converges. The migration sits at roughly 99% for an extended period; `DataRemaining` and `MemoryRemaining` oscillate up and down rather than trending towards zero, and the `virt-launcher` source pod eventually logs an abort of the form `Live migration abort detected with reason: Live migration stuck for sec`. + +Representative source-pod log excerpt from `virt-launcher`: + +```text +Migration info for : + TimeElapsed:561972ms DataProcessed:167182MiB DataRemaining:2350MiB + DataTotal:24597MiB MemoryProcessed:167181MiB MemoryRemaining:2350MiB + MemoryTotal:24596MiB MemoryBandwidth:38Mbps DirtyRate:41Mbps + Iteration:80 PostcopyRequests:0 ConstantPages:1543059 + NormalPages:42711515 ... + +Live migration stuck for 181006630894 sec +Live migration abort detected with reason: Live migration stuck for + 181004616448 sec and has been aborted +``` + +The `DirtyRate` in the log is consistently higher than the `MemoryBandwidth`, so pages are being dirtied faster than the copy stream can ship them to the destination node. The migration is effectively chasing its own tail. + +## Root Cause + +KubeVirt's default live-migration algorithm is **pre-copy**: memory pages are streamed to the destination while the VM keeps running on the source, with dirty pages resent iteratively until the remaining delta is small enough for a final brief pause ("cutover"). Convergence is only possible when, per iteration, bandwidth × time > dirty-bytes — in other words, when `MemoryBandwidth` exceeds `DirtyRate` for long enough to drive `MemoryRemaining` down to the cutover threshold. + +For memory-intensive workloads (databases, caching tiers, VMs under heavy write load) the dirty rate can stay above the migration bandwidth ceiling for the full duration of the migration. Pre-copy then never converges; eventually KubeVirt's progress timeout fires and the migration is aborted. The symptom is exactly the `DataRemaining` oscillation seen in the log. + +## Resolution + +Give the migration a convergence strategy other than pure pre-copy. KubeVirt exposes three practical options; pick based on whether the VM is mid-migration, whether a brief pause is acceptable, and whether the change should apply cluster-wide or per-VM. + +1. **Pause the running VM to force cutover.** For a migration that is already in progress and almost converged, temporarily pausing the source VM lets the final copy round finish because no more pages are dirtied. The VM auto-resumes on the destination once cutover completes. This imposes a short downtime window but requires no configuration change. Use `virtctl`: + + ```bash + virtctl pause vm -n + ``` + +2. **Enable post-copy as a cluster-wide fallback.** Post-copy flips the model: after a bounded pre-copy phase, the VM is resumed on the destination, and page faults for not-yet-shipped memory are pulled on demand across the network. This always converges for any dirty rate, at the cost of a brief period where the VM's memory latency depends on network round-trips. + + Post-copy is an OSS KubeVirt feature (`allowPostCopy` on the migration configuration) and is available on ACP virtualization. Turn it on at the KubeVirt cluster-config level: + + ```yaml + apiVersion: kubevirt.io/v1 + kind: KubeVirt + metadata: + name: kubevirt + namespace: + spec: + configuration: + migrations: + bandwidthPerMigration: 64Mi + completionTimeoutPerGiB: 800 + parallelMigrationsPerCluster: 5 + parallelOutboundMigrationsPerNode: 2 + progressTimeout: 150 + allowPostCopy: true + ``` + + Apply with: + + ```bash + kubectl -n edit kubevirt kubevirt + ``` + + After enabling post-copy, cancel any hung migration so the next retry picks up the new policy: + + ```bash + kubectl -n delete virtualmachineinstancemigration + ``` + + Once post-copy is active, lowering `completionTimeoutPerGiB` in the same block accelerates the transition from pre-copy into post-copy — the pre-copy phase ends sooner and the VM resumes on the destination faster. The default timeout is tuned for converging pre-copy workloads; trimming it is what makes post-copy actually kick in for dirty VMs. + +3. **Enable post-copy only for a specific VM with a MigrationPolicy.** When cluster-wide post-copy is too broad, a `MigrationPolicy` object selects individual VMs by label and applies a tailored migration configuration: + + ```yaml + apiVersion: migrations.kubevirt.io/v1alpha1 + kind: MigrationPolicy + metadata: + name: my-vm-post-copy + spec: + allowPostCopy: true + selectors: + virtualMachineInstanceSelector: + kubevirt.io/domain: + ``` + + ```bash + kubectl apply -f migrationpolicy.yaml + ``` + + KubeVirt matches each migration against the `MigrationPolicy` set and uses the first matching policy's settings in preference to cluster defaults. + +For migrations that must complete during a drain (planned node maintenance, rolling upgrade of the hypervisor fleet), the generally safer option is post-copy enabled at the cluster level: it guarantees forward progress regardless of workload, so drains do not stall on a single busy VM. Where a bounded-downtime window is acceptable, the `virtctl pause` option is the least invasive. + +## Diagnostic Steps + +1. Confirm the migration is actually failing on convergence (rather than a network or storage error). Examine the `virt-launcher` source pod for the stuck VM: + + ```bash + kubectl logs -n \ + | grep -E "Migration info|stuck|abort" + ``` + + Convergence failure shows `DirtyRate` greater than `MemoryBandwidth` and `MemoryRemaining` oscillating across iterations. + +2. Check the `VirtualMachineInstanceMigration` object for the abort reason: + + ```bash + kubectl -n get virtualmachineinstancemigration + kubectl -n get virtualmachineinstancemigration -o yaml \ + | sed -n '/status:/,$p' + ``` + +3. Verify whether post-copy is permitted on the current cluster configuration: + + ```bash + kubectl -n get kubevirt kubevirt \ + -o jsonpath='{.spec.configuration.migrations.allowPostCopy}{"\n"}' + ``` + + An empty output or `false` means pre-copy is the only strategy the cluster will attempt. + +4. After changing `allowPostCopy` or applying a `MigrationPolicy`, re-trigger the migration and watch the `Migration info` lines. A successful post-copy transition shows `PostcopyRequests` climbing from zero and `MemoryRemaining` completing instead of oscillating. `MemoryBandwidth` may temporarily drop after switchover because the VM is now paging on demand across the network — that is expected for the post-copy phase. + +5. If `parallelMigrationsPerCluster` or `parallelOutboundMigrationsPerNode` is saturated during a node drain, migrations queue rather than abort. Inspect outstanding migrations cluster-wide: + + ```bash + kubectl get virtualmachineinstancemigration -A + ``` + + Tune the migration parallelism limits in the KubeVirt `migrations` block to match the available cross-node bandwidth so queued VMs pick up the new policy promptly. + + \ No newline at end of file