DNM: Test: disable pod restart workaround after file-system restore#2134
DNM: Test: disable pod restart workaround after file-system restore#2134kaovilai wants to merge 2 commits intoopenshift:oadp-devfrom
Conversation
Comment out the KOPIA post-restore pod deletion to test whether the OVN-Kubernetes networking workaround is still required. If E2E tests pass without it, this workaround can be removed entirely. Made-with: Cursor
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kaovilai The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe changes comment out post-restore pod restart workarounds for KOPIA backup restore operations in two test files. One file removes unused Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Disables (by commenting out) the post-restore “delete pods to recover networking” workaround for file-system (KOPIA) backup/restore E2E flows, to validate whether OVN-Kubernetes networking is now correctly wired after Velero restores.
Changes:
- Commented out the post-restore pod deletion workaround in the main backup/restore E2E suite.
- Commented out the same workaround in the CLI-driven backup/restore E2E suite.
- Cleaned up now-unused imports in the CLI suite file.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/e2e/backup_restore_suite_test.go | Disables the KOPIA post-restore pod-restart workaround in the core E2E flow. |
| tests/e2e/backup_restore_cli_suite_test.go | Disables the same workaround in the CLI E2E flow and removes unused imports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TODO: Testing whether this workaround is still needed. Remove if | ||
| // file-system restore tests pass without it. | ||
| // | ||
| // For file-system backup restores (KOPIA/restic), the restored pods may have |
There was a problem hiding this comment.
The comment says "KOPIA/restic", but this code path is guarded by lib.KOPIA and there is no restic BackupRestoreType in tests/e2e/lib. Please update the wording if restic is no longer applicable.
| // For file-system backup restores (KOPIA/restic), the restored pods may have | |
| // For file-system backup restores (KOPIA), the restored pods may have |
| // TODO: Testing whether this workaround is still needed. Remove if | ||
| // file-system restore tests pass without it. | ||
| // | ||
| // For file-system backup restores (KOPIA/restic), the restored pods may have | ||
| // broken networking because OVN-Kubernetes doesn't fully wire the network | ||
| // namespace for pods recreated by Velero with a restore-wait init container. | ||
| // Deleting the pods lets the deployment controller create fresh ones with | ||
| // proper networking while preserving the restored PVC data. | ||
| if brCase.BackupRestoreType == lib.KOPIA { | ||
| log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| context.Background(), | ||
| metav1.DeleteOptions{}, | ||
| metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| ) | ||
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| } | ||
| // if brCase.BackupRestoreType == lib.KOPIA { | ||
| // log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| // err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| // context.Background(), | ||
| // metav1.DeleteOptions{}, | ||
| // metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| // ) | ||
| // gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| // } |
There was a problem hiding this comment.
Instead of leaving the workaround as a large commented-out code block, consider keeping it as live code and gating it behind an explicit toggle (e.g., a ginkgo flag or env var like E2E_DISABLE_KOPIA_POD_RESTART_WORKAROUND). That keeps the test intent clear, avoids accumulating dead code in the suite, and makes it easy to re-enable in CI without editing source again.
| // TODO: Testing whether this workaround is still needed. Remove if | ||
| // file-system restore tests pass without it. | ||
| // | ||
| // For file-system backup restores (KOPIA/restic), the restored pods may have |
There was a problem hiding this comment.
The comment mentions "KOPIA/restic", but this test suite only has a KOPIA backup/restore type (no restic type in lib.BackupRestoreType). If restic is no longer supported here, please update the wording to avoid confusion/outdated documentation.
| // For file-system backup restores (KOPIA/restic), the restored pods may have | |
| // For file-system backup restores using KOPIA, the restored pods may have |
| // TODO: Testing whether this workaround is still needed. Remove if | ||
| // file-system restore tests pass without it. | ||
| // | ||
| // For file-system backup restores (KOPIA/restic), the restored pods may have | ||
| // broken networking because OVN-Kubernetes doesn't fully wire the network | ||
| // namespace for pods recreated by Velero with a restore-wait init container. | ||
| // Deleting the pods lets the deployment controller create fresh ones with | ||
| // proper networking while preserving the restored PVC data. | ||
| if brCase.BackupRestoreType == lib.KOPIA { | ||
| log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| context.Background(), | ||
| metav1.DeleteOptions{}, | ||
| metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| ) | ||
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| } | ||
| // if brCase.BackupRestoreType == lib.KOPIA { | ||
| // log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| // err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| // context.Background(), | ||
| // metav1.DeleteOptions{}, | ||
| // metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| // ) | ||
| // gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| // } |
There was a problem hiding this comment.
Instead of commenting out the pod-restart workaround (and having to also remove/re-add imports when toggling), consider keeping it as live code behind an explicit toggle (ginkgo flag or env var). That makes it easy to run both variants in CI and avoids leaving large blocks of commented code in the suite long-term.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/backup_restore_cli_suite_test.go`:
- Around line 176-192: The CLI test disables the pod-restart workaround
unconditionally; instead gate it behind the same opt-in switch used by the
non-CLI path so clusters that need the workaround opt into it; update the
commented block around brCase.BackupRestoreType == lib.KOPIA to check the shared
env-var/helper (the same helper used by the non-CLI code) before calling
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
and preserve the existing behavior (log.Printf and
gomega.Expect(err).ToNot(gomega.HaveOccurred())) when the gate is enabled.
In `@tests/e2e/backup_restore_suite_test.go`:
- Around line 269-285: The commented-out pod-restart workaround for KOPIA
restores should be restored but gated behind a configurable opt-in flag rather
than enabled by default: re-enable the block that checks if
brCase.BackupRestoreType == lib.KOPIA and calls
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
but wrap it in a condition that reads an explicit environment variable or
test-case flag (e.g., ENABLE_KOPIA_RESTART or brCase.ExperimentalRestart bool)
that defaults to false; update test setup to read
os.Getenv("ENABLE_KOPIA_RESTART") (or the test harness config) and only perform
the DeleteCollection when that flag is true, and document the flag so jobs can
opt into the experimental behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 49d55c37-d2e6-4fe8-9661-1f21801714a9
📒 Files selected for processing (2)
tests/e2e/backup_restore_cli_suite_test.gotests/e2e/backup_restore_suite_test.go
| // TODO: Testing whether this workaround is still needed. Remove if | ||
| // file-system restore tests pass without it. | ||
| // | ||
| // For file-system backup restores (KOPIA/restic), the restored pods may have | ||
| // broken networking because OVN-Kubernetes doesn't fully wire the network | ||
| // namespace for pods recreated by Velero with a restore-wait init container. | ||
| // Deleting the pods lets the deployment controller create fresh ones with | ||
| // proper networking while preserving the restored PVC data. | ||
| if brCase.BackupRestoreType == lib.KOPIA { | ||
| log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| context.Background(), | ||
| metav1.DeleteOptions{}, | ||
| metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| ) | ||
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| } | ||
| // if brCase.BackupRestoreType == lib.KOPIA { | ||
| // log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| // err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| // context.Background(), | ||
| // metav1.DeleteOptions{}, | ||
| // metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| // ) | ||
| // gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| // } |
There was a problem hiding this comment.
Keep the CLI lane behind the same opt-in switch.
This has the same reliability risk in the CLI suite: if a cluster still needs the workaround, all KOPIA CLI restore jobs will start failing by default. If this moves past DNM, please wire this through the same env-var/helper gate as the non-CLI path instead of hard-disabling it here.
As per coding guidelines, **: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/backup_restore_cli_suite_test.go` around lines 176 - 192, The CLI
test disables the pod-restart workaround unconditionally; instead gate it behind
the same opt-in switch used by the non-CLI path so clusters that need the
workaround opt into it; update the commented block around
brCase.BackupRestoreType == lib.KOPIA to check the shared env-var/helper (the
same helper used by the non-CLI code) before calling
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
and preserve the existing behavior (log.Printf and
gomega.Expect(err).ToNot(gomega.HaveOccurred())) when the gate is enabled.
| // TODO: Testing whether this workaround is still needed. Remove if | ||
| // file-system restore tests pass without it. | ||
| // | ||
| // For file-system backup restores (KOPIA/restic), the restored pods may have | ||
| // broken networking because OVN-Kubernetes doesn't fully wire the network | ||
| // namespace for pods recreated by Velero with a restore-wait init container. | ||
| // Deleting the pods lets the deployment controller create fresh ones with | ||
| // proper networking while preserving the restored PVC data. | ||
| if brCase.BackupRestoreType == lib.KOPIA { | ||
| log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| context.Background(), | ||
| metav1.DeleteOptions{}, | ||
| metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| ) | ||
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| } | ||
| // if brCase.BackupRestoreType == lib.KOPIA { | ||
| // log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | ||
| // err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | ||
| // context.Background(), | ||
| // metav1.DeleteOptions{}, | ||
| // metav1.ListOptions{LabelSelector: "e2e-app=true"}, | ||
| // ) | ||
| // gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| // } |
There was a problem hiding this comment.
Don’t make this experiment the default KOPIA path.
The code comment still documents this restart as a workaround for a known post-restore networking issue, so commenting it out here will make every non-CLI KOPIA job depend on cluster-specific OVN behavior again. If this goes past DNM, keep the current behavior as default and disable it only via an env var/label or a dedicated experimental lane.
One way to gate the experiment
- // TODO: Testing whether this workaround is still needed. Remove if
- // file-system restore tests pass without it.
- //
+ // TODO: Remove this once filesystem-restore tests are stable without the OVN workaround.
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
- // if brCase.BackupRestoreType == lib.KOPIA {
- // log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
- // err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
- // context.Background(),
- // metav1.DeleteOptions{},
- // metav1.ListOptions{LabelSelector: "e2e-app=true"},
- // )
- // gomega.Expect(err).ToNot(gomega.HaveOccurred())
- // }
+ if brCase.BackupRestoreType == lib.KOPIA && os.Getenv("OADP_DISABLE_FS_RESTORE_POD_RESTART_WORKAROUND") != "true" {
+ log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
+ err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
+ context.Background(),
+ metav1.DeleteOptions{},
+ metav1.ListOptions{LabelSelector: "e2e-app=true"},
+ )
+ gomega.Expect(err).ToNot(gomega.HaveOccurred())
+ }As per coding guidelines, **: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // TODO: Testing whether this workaround is still needed. Remove if | |
| // file-system restore tests pass without it. | |
| // | |
| // For file-system backup restores (KOPIA/restic), the restored pods may have | |
| // broken networking because OVN-Kubernetes doesn't fully wire the network | |
| // namespace for pods recreated by Velero with a restore-wait init container. | |
| // Deleting the pods lets the deployment controller create fresh ones with | |
| // proper networking while preserving the restored PVC data. | |
| if brCase.BackupRestoreType == lib.KOPIA { | |
| log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | |
| err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | |
| context.Background(), | |
| metav1.DeleteOptions{}, | |
| metav1.ListOptions{LabelSelector: "e2e-app=true"}, | |
| ) | |
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | |
| } | |
| // if brCase.BackupRestoreType == lib.KOPIA { | |
| // log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | |
| // err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | |
| // context.Background(), | |
| // metav1.DeleteOptions{}, | |
| // metav1.ListOptions{LabelSelector: "e2e-app=true"}, | |
| // ) | |
| // gomega.Expect(err).ToNot(gomega.HaveOccurred()) | |
| // } | |
| // TODO: Remove this once filesystem-restore tests are stable without the OVN workaround. | |
| // For file-system backup restores (KOPIA/restic), the restored pods may have | |
| // broken networking because OVN-Kubernetes doesn't fully wire the network | |
| // namespace for pods recreated by Velero with a restore-wait init container. | |
| // Deleting the pods lets the deployment controller create fresh ones with | |
| // proper networking while preserving the restored PVC data. | |
| if brCase.BackupRestoreType == lib.KOPIA && os.Getenv("OADP_DISABLE_FS_RESTORE_POD_RESTART_WORKAROUND") != "true" { | |
| log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace) | |
| err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection( | |
| context.Background(), | |
| metav1.DeleteOptions{}, | |
| metav1.ListOptions{LabelSelector: "e2e-app=true"}, | |
| ) | |
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/backup_restore_suite_test.go` around lines 269 - 285, The
commented-out pod-restart workaround for KOPIA restores should be restored but
gated behind a configurable opt-in flag rather than enabled by default:
re-enable the block that checks if brCase.BackupRestoreType == lib.KOPIA and
calls
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
but wrap it in a condition that reads an explicit environment variable or
test-case flag (e.g., ENABLE_KOPIA_RESTART or brCase.ExperimentalRestart bool)
that defaults to false; update test setup to read
os.Getenv("ENABLE_KOPIA_RESTART") (or the test harness config) and only perform
the DeleteCollection when that flag is true, and document the flag so jobs can
opt into the experimental behavior.
|
Show me some 🌽 :flakes: /test all |
|
@kaovilai: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
backup_restore_suite_test.goandbackup_restore_cli_suite_test.goTest plan
Made with Cursor