Skip to content

DNM: Test: disable pod restart workaround after file-system restore#2134

Open
kaovilai wants to merge 2 commits intoopenshift:oadp-devfrom
kaovilai:disable-pod-restart-after-fs-restore
Open

DNM: Test: disable pod restart workaround after file-system restore#2134
kaovilai wants to merge 2 commits intoopenshift:oadp-devfrom
kaovilai:disable-pod-restart-after-fs-restore

Conversation

@kaovilai
Copy link
Member

Summary

  • Comments out the post-restore pod deletion workaround for KOPIA (file-system) backups in both backup_restore_suite_test.go and backup_restore_cli_suite_test.go
  • This workaround was added because OVN-Kubernetes didn't fully wire the network namespace for pods recreated by Velero with a restore-wait init container
  • Goal: determine whether this workaround is still needed by running E2E tests without it

Test plan

  • KOPIA backup/restore E2E tests pass without the pod restart workaround
  • CLI KOPIA backup/restore E2E tests pass without the pod restart workaround
  • If tests fail with networking issues, revert this change and keep the workaround

Made with Cursor

weshayutin and others added 2 commits March 24, 2026 11:43
Comment out the KOPIA post-restore pod deletion to test whether the
OVN-Kubernetes networking workaround is still required. If E2E tests
pass without it, this workaround can be removed entirely.

Made-with: Cursor
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Copilot AI review requested due to automatic review settings March 25, 2026 06:21
@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 25, 2026

Walkthrough

The changes comment out post-restore pod restart workarounds for KOPIA backup restore operations in two test files. One file removes unused context and metav1 imports. TODO comments document that workarounds require re-evaluation once filesystem restore tests pass independently.

Changes

Cohort / File(s) Summary
KOPIA Backup Restore Workaround Cleanup
tests/e2e/backup_restore_cli_suite_test.go, tests/e2e/backup_restore_suite_test.go
Commented out post-restore pod restart workarounds triggered for lib.KOPIA restore type. Pod deletion via DeleteCollection and associated error handling removed from execution path. Unused context and metav1 imports removed from one file. TODO comments added to track re-evaluation of workarounds.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Disables (by commenting out) the post-restore “delete pods to recover networking” workaround for file-system (KOPIA) backup/restore E2E flows, to validate whether OVN-Kubernetes networking is now correctly wired after Velero restores.

Changes:

  • Commented out the post-restore pod deletion workaround in the main backup/restore E2E suite.
  • Commented out the same workaround in the CLI-driven backup/restore E2E suite.
  • Cleaned up now-unused imports in the CLI suite file.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/e2e/backup_restore_suite_test.go Disables the KOPIA post-restore pod-restart workaround in the core E2E flow.
tests/e2e/backup_restore_cli_suite_test.go Disables the same workaround in the CLI E2E flow and removes unused imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "KOPIA/restic", but this code path is guarded by lib.KOPIA and there is no restic BackupRestoreType in tests/e2e/lib. Please update the wording if restic is no longer applicable.

Suggested change
// For file-system backup restores (KOPIA/restic), the restored pods may have
// For file-system backup restores (KOPIA), the restored pods may have

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +285
// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
if brCase.BackupRestoreType == lib.KOPIA {
log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
context.Background(),
metav1.DeleteOptions{},
metav1.ListOptions{LabelSelector: "e2e-app=true"},
)
gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
// if brCase.BackupRestoreType == lib.KOPIA {
// log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
// err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
// context.Background(),
// metav1.DeleteOptions{},
// metav1.ListOptions{LabelSelector: "e2e-app=true"},
// )
// gomega.Expect(err).ToNot(gomega.HaveOccurred())
// }
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of leaving the workaround as a large commented-out code block, consider keeping it as live code and gating it behind an explicit toggle (e.g., a ginkgo flag or env var like E2E_DISABLE_KOPIA_POD_RESTART_WORKAROUND). That keeps the test intent clear, avoids accumulating dead code in the suite, and makes it easy to re-enable in CI without editing source again.

Copilot uses AI. Check for mistakes.
// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "KOPIA/restic", but this test suite only has a KOPIA backup/restore type (no restic type in lib.BackupRestoreType). If restic is no longer supported here, please update the wording to avoid confusion/outdated documentation.

Suggested change
// For file-system backup restores (KOPIA/restic), the restored pods may have
// For file-system backup restores using KOPIA, the restored pods may have

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +192
// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
if brCase.BackupRestoreType == lib.KOPIA {
log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
context.Background(),
metav1.DeleteOptions{},
metav1.ListOptions{LabelSelector: "e2e-app=true"},
)
gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
// if brCase.BackupRestoreType == lib.KOPIA {
// log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
// err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
// context.Background(),
// metav1.DeleteOptions{},
// metav1.ListOptions{LabelSelector: "e2e-app=true"},
// )
// gomega.Expect(err).ToNot(gomega.HaveOccurred())
// }
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of commenting out the pod-restart workaround (and having to also remove/re-add imports when toggling), consider keeping it as live code behind an explicit toggle (ginkgo flag or env var). That makes it easy to run both variants in CI and avoids leaving large blocks of commented code in the suite long-term.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/backup_restore_cli_suite_test.go`:
- Around line 176-192: The CLI test disables the pod-restart workaround
unconditionally; instead gate it behind the same opt-in switch used by the
non-CLI path so clusters that need the workaround opt into it; update the
commented block around brCase.BackupRestoreType == lib.KOPIA to check the shared
env-var/helper (the same helper used by the non-CLI code) before calling
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
and preserve the existing behavior (log.Printf and
gomega.Expect(err).ToNot(gomega.HaveOccurred())) when the gate is enabled.

In `@tests/e2e/backup_restore_suite_test.go`:
- Around line 269-285: The commented-out pod-restart workaround for KOPIA
restores should be restored but gated behind a configurable opt-in flag rather
than enabled by default: re-enable the block that checks if
brCase.BackupRestoreType == lib.KOPIA and calls
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
but wrap it in a condition that reads an explicit environment variable or
test-case flag (e.g., ENABLE_KOPIA_RESTART or brCase.ExperimentalRestart bool)
that defaults to false; update test setup to read
os.Getenv("ENABLE_KOPIA_RESTART") (or the test harness config) and only perform
the DeleteCollection when that flag is true, and document the flag so jobs can
opt into the experimental behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 49d55c37-d2e6-4fe8-9661-1f21801714a9

📥 Commits

Reviewing files that changed from the base of the PR and between f2f8c5e and dbee900.

📒 Files selected for processing (2)
  • tests/e2e/backup_restore_cli_suite_test.go
  • tests/e2e/backup_restore_suite_test.go

Comment on lines +176 to +192
// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
if brCase.BackupRestoreType == lib.KOPIA {
log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
context.Background(),
metav1.DeleteOptions{},
metav1.ListOptions{LabelSelector: "e2e-app=true"},
)
gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
// if brCase.BackupRestoreType == lib.KOPIA {
// log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
// err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
// context.Background(),
// metav1.DeleteOptions{},
// metav1.ListOptions{LabelSelector: "e2e-app=true"},
// )
// gomega.Expect(err).ToNot(gomega.HaveOccurred())
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the CLI lane behind the same opt-in switch.

This has the same reliability risk in the CLI suite: if a cluster still needs the workaround, all KOPIA CLI restore jobs will start failing by default. If this moves past DNM, please wire this through the same env-var/helper gate as the non-CLI path instead of hard-disabling it here.

As per coding guidelines, **: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/backup_restore_cli_suite_test.go` around lines 176 - 192, The CLI
test disables the pod-restart workaround unconditionally; instead gate it behind
the same opt-in switch used by the non-CLI path so clusters that need the
workaround opt into it; update the commented block around
brCase.BackupRestoreType == lib.KOPIA to check the shared env-var/helper (the
same helper used by the non-CLI code) before calling
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
and preserve the existing behavior (log.Printf and
gomega.Expect(err).ToNot(gomega.HaveOccurred())) when the gate is enabled.

Comment on lines +269 to +285
// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
if brCase.BackupRestoreType == lib.KOPIA {
log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
context.Background(),
metav1.DeleteOptions{},
metav1.ListOptions{LabelSelector: "e2e-app=true"},
)
gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
// if brCase.BackupRestoreType == lib.KOPIA {
// log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
// err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
// context.Background(),
// metav1.DeleteOptions{},
// metav1.ListOptions{LabelSelector: "e2e-app=true"},
// )
// gomega.Expect(err).ToNot(gomega.HaveOccurred())
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t make this experiment the default KOPIA path.

The code comment still documents this restart as a workaround for a known post-restore networking issue, so commenting it out here will make every non-CLI KOPIA job depend on cluster-specific OVN behavior again. If this goes past DNM, keep the current behavior as default and disable it only via an env var/label or a dedicated experimental lane.

One way to gate the experiment
- // TODO: Testing whether this workaround is still needed. Remove if
- // file-system restore tests pass without it.
- //
+ // TODO: Remove this once filesystem-restore tests are stable without the OVN workaround.
  // For file-system backup restores (KOPIA/restic), the restored pods may have
  // broken networking because OVN-Kubernetes doesn't fully wire the network
  // namespace for pods recreated by Velero with a restore-wait init container.
  // Deleting the pods lets the deployment controller create fresh ones with
  // proper networking while preserving the restored PVC data.
- // if brCase.BackupRestoreType == lib.KOPIA {
- // 	log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
- // 	err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
- // 		context.Background(),
- // 		metav1.DeleteOptions{},
- // 		metav1.ListOptions{LabelSelector: "e2e-app=true"},
- // 	)
- // 	gomega.Expect(err).ToNot(gomega.HaveOccurred())
- // }
+ if brCase.BackupRestoreType == lib.KOPIA && os.Getenv("OADP_DISABLE_FS_RESTORE_POD_RESTART_WORKAROUND") != "true" {
+ 	log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
+ 	err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
+ 		context.Background(),
+ 		metav1.DeleteOptions{},
+ 		metav1.ListOptions{LabelSelector: "e2e-app=true"},
+ 	)
+ 	gomega.Expect(err).ToNot(gomega.HaveOccurred())
+ }

As per coding guidelines, **: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TODO: Testing whether this workaround is still needed. Remove if
// file-system restore tests pass without it.
//
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
if brCase.BackupRestoreType == lib.KOPIA {
log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
context.Background(),
metav1.DeleteOptions{},
metav1.ListOptions{LabelSelector: "e2e-app=true"},
)
gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
// if brCase.BackupRestoreType == lib.KOPIA {
// log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
// err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
// context.Background(),
// metav1.DeleteOptions{},
// metav1.ListOptions{LabelSelector: "e2e-app=true"},
// )
// gomega.Expect(err).ToNot(gomega.HaveOccurred())
// }
// TODO: Remove this once filesystem-restore tests are stable without the OVN workaround.
// For file-system backup restores (KOPIA/restic), the restored pods may have
// broken networking because OVN-Kubernetes doesn't fully wire the network
// namespace for pods recreated by Velero with a restore-wait init container.
// Deleting the pods lets the deployment controller create fresh ones with
// proper networking while preserving the restored PVC data.
if brCase.BackupRestoreType == lib.KOPIA && os.Getenv("OADP_DISABLE_FS_RESTORE_POD_RESTART_WORKAROUND") != "true" {
log.Printf("Restarting pods in namespace %s to ensure proper networking after file-system restore", brCase.Namespace)
err = kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(
context.Background(),
metav1.DeleteOptions{},
metav1.ListOptions{LabelSelector: "e2e-app=true"},
)
gomega.Expect(err).ToNot(gomega.HaveOccurred())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/backup_restore_suite_test.go` around lines 269 - 285, The
commented-out pod-restart workaround for KOPIA restores should be restored but
gated behind a configurable opt-in flag rather than enabled by default:
re-enable the block that checks if brCase.BackupRestoreType == lib.KOPIA and
calls
kubernetesClientForSuiteRun.CoreV1().Pods(brCase.Namespace).DeleteCollection(...),
but wrap it in a condition that reads an explicit environment variable or
test-case flag (e.g., ENABLE_KOPIA_RESTART or brCase.ExperimentalRestart bool)
that defaults to false; update test setup to read
os.Getenv("ENABLE_KOPIA_RESTART") (or the test harness config) and only perform
the DeleteCollection when that flag is true, and document the flag so jobs can
opt into the experimental behavior.

@kaovilai
Copy link
Member Author

Show me some 🌽 :flakes:

/test all

@openshift-ci
Copy link

openshift-ci bot commented Mar 25, 2026

@kaovilai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.22-e2e-test-aws dbee900 link true /test 4.22-e2e-test-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants