Skip to content

WIP OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature#31083

Draft
BhargaviGudi wants to merge 1 commit intoopenshift:mainfrom
BhargaviGudi:additionalArtifactsAutomation
Draft

WIP OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature#31083
BhargaviGudi wants to merge 1 commit intoopenshift:mainfrom
BhargaviGudi:additionalArtifactsAutomation

Conversation

@BhargaviGudi
Copy link
Copy Markdown
Contributor

@BhargaviGudi BhargaviGudi commented Apr 28, 2026

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade):

  • additionalArtifactStores (max 10 stores)
  • additionalImageStores (max 10 stores)
  • additionalLayerStores (max 5 stores, supports :ref suffix)
  • Combined storage configurations Test suite: openshift/additional-storage-support

Summary by CodeRabbit

  • Tests
    • Added comprehensive API validation and E2E tests for CRI-O additional storage (artifact, image, layer, combined): creation, updates, deletions, edge-case validations, rollout behavior, and performance checks; includes disruptive and non-disruptive suites gated behind the feature flag.
  • Chores
    • Added test utilities and helpers for directory orchestration, stargz-store deployment/verification, storage config checks, MCP rollout polling, and registered a new built-in test suite for additional storage support.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 28, 2026

@BhargaviGudi: This pull request references OCPNODE-4055 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade):

  • additionalArtifactStores (max 10 stores)
  • additionalImageStores (max 10 stores)
  • additionalLayerStores (max 5 stores, supports :ref suffix)
  • Combined storage configurations Test suite: openshift/additional-storage-support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 28, 2026
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Walkthrough

Adds end-to-end and API-validation tests for CRI-O additional storage (layer/image/artifact/combined), supporting helpers for node directory management, stargz-store deployment, and test utilities; and registers a new built-in test suite entry.

Changes

Additional storage tests and helpers

Layer / File(s) Summary
Helpers / Data
test/extended/node/node_utils.go
New MCP-wait helper, node directory creation (direct and MachineConfig), ContainerRuntimeConfig/DaemonSet/Pod helpers, image prepopulation, and polling utilities.
Stargz support
test/extended/node/stargz_store_setup.go
Adds StargzStoreSetup type and methods to deploy/verify/cleanup stargz-store via privileged DaemonSet and systemd unit; helpers to verify FUSE mount, check storage.conf, count snapshots, and create node-pinned test pods.
API validation tests
test/extended/node/additional_image_stores.go, .../additional_layer_stores.go, .../additional_artifact_stores.go, .../additional_stores_combined.go
Non-disruptive Ginkgo suites (feature-gated) exercising ContainerRuntimeConfig API validation: path formats, max counts, duplicates, invalid chars, length, consecutive slashes, and :ref rules across layer/image/artifact/combined configurations.
Disruptive E2E tests
test/extended/node/additional_image_stores.go, .../additional_layer_stores.go, .../additional_artifact_stores.go, .../additional_stores_combined.go
Serial disruptive suites (feature-gated, Azure-skipped) that create/remove/update CRCs, provision directories on worker nodes, wait for MCO/MCP rollouts, verify /etc/containers/storage.conf or CRI-O config entries, confirm crio active and node Ready, and exercise stargz-specific snapshot/pod behaviors.
Test registration
pkg/testsuites/standard_suites.go
Registers new built-in suite openshift/additional-storage-support (Feature:AdditionalStorageSupport), Parallelism=1, 60m timeout, marked Disruptive.

Sequence Diagram(s)

sequenceDiagram
    actor Tester
    participant API as Kubernetes API
    participant MCO as Machine Config Operator
    participant MCP as MachineConfigPool
    participant Node as Worker Node
    participant CRIO as CRI-O Service
    participant Stargz as stargz-store

    Tester->>API: Create/Update/Delete ContainerRuntimeConfig
    API->>MCO: Notify new CRC
    MCO->>MCO: Validate & emit MachineConfig
    MCO->>MCP: Deliver MachineConfig
    MCP->>Node: Trigger rollout (drain/upgrade/uncordon)
    Note over Node: Apply MachineConfig files
    Node->>CRIO: Restart CRI-O
    CRIO->>CRIO: Load updated storage configuration
    alt stargz enabled
      MCO->>Node: Deploy/enable stargz-store
      Node->>Stargz: Start FUSE mount & service
      Stargz->>CRIO: Provide layer store backend
    end
    Node->>Tester: Report config file presence & service active
    Tester->>API: Delete/Update CRC
    API->>MCO: Propagate removal/update
    MCO->>MCP: Emit updated MachineConfig
    MCP->>Node: Rollout update
    Node->>Tester: Confirm config removed/updated and nodes Ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 6 | ❌ 6

❌ Failed checks (6 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning TC11 lacks defer cleanup immediately after CRC creation (~line 507). TC12/TC13 don't guard empty pureWorkers slice. ~173 assertions lack meaningful failure messages across test files. Add defer cleanupContainerRuntimeConfig() after CRC creation in TC11. Add pureWorkers length guards to TC12/TC13. Add descriptive messages to all assertions per: Expect(x).NotTo(HaveOccurred(), "description")
Microshift Test Compatibility ⚠️ Warning ~50 test specs use config.openshift.io/v1 (FeatureGate) and machineconfiguration.openshift.io/v1 APIs unavailable on MicroShift. No MicroShift protections found. Add [apigroup:config.openshift.io] and [apigroup:machineconfiguration.openshift.io] tags to test names, or use [Skipped:MicroShift] labels for MicroShift CI auto-skip behavior.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning TC12 and TC13 in additional_layer_stores.go call getPureWorkerNodes() without checking if empty, violating SNO compatibility. Tests silently pass without validation on SNO. Add SNO guards after getPureWorkerNodes() calls: if len(pureWorkers) < 1 { e2eskipper.Skipf("Need at least 1 worker node") }. Or add [Skipped:SingleReplicaTopology] labels to all disruptive E2E test specs.
Topology-Aware Scheduling Compatibility ⚠️ Warning Test DaemonSets and Pods use nodeSelector targeting node-role.kubernetes.io/worker without topology-awareness checks. Fails on SNO/TNF/TNA where dedicated workers don't exist. Add topology checks using GetControlPlaneTopology() similar to kubeletconfig_features.go. Skip tests on non-HA topologies or handle control-plane nodes as fallback.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Tests contain hardcoded external registry image references (quay.io) and external downloads (github.com) without [Skipped:Disconnected] markers, failing in IPv6-only disconnected environments. Add [Skipped:Disconnected] to TC10, TC11 test names in additional_image_stores.go and additional_layer_stores.go. For stargz_store_setup.go, either skip the tests or pre-bundle the stargz-snapshotter binary in the test image.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a comprehensive test suite for the Additional Storage Support feature (OCPNODE-4055), which aligns with all the file additions in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic with static strings only. No dynamic values like pod/node names, timestamps, UUIDs, or IP addresses in titles.
Ote Binary Stdout Contract ✅ Passed All test files follow proper Ginkgo patterns with only simple const declarations and g.Describe() suite registrations at module level. No process-level stdout writes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/testsuites/standard_suites.go`:
- Around line 504-506: The suite's Qualifiers currently filter for
`"[Suite:openshift/additional-storage-support]"` but the tests are tagged
`"[Feature:AdditionalStorageSupport]"`, so update the Qualifiers slice in the
suite definition (the `Qualifiers: []string{ ... }` entry) to match the test
tag—e.g. replace or add `name.contains("[Feature:AdditionalStorageSupport]")`
(or include both selectors if you want to run both forms) so the new tests are
actually selected.

In `@test/extended/node/additional_artifact_stores.go`:
- Around line 460-481: The test races because it calls waitForMCP(...)
immediately after mutating a ContainerRuntimeConfig; add a prior gate call to
waitForMCPToStartUpdating(...) so the helper observes the pool beginning to
update before waiting for rollout: call waitForMCPToStartUpdating(ctx, mcClient,
"worker", <short timeout>) and check the error, then call waitForMCP(ctx,
mcClient, "worker", 25*time.Minute) as before; apply the same change for all
other ContainerRuntimeConfig mutation sites (e.g., the blocks referencing
ContainerRuntimeConfig, waitForMCP, and generatedMCName).
- Around line 45-89: The test currently only logs when an invalid
ContainerRuntimeConfig (ctrcfg) is accepted instead of failing the spec; change
the negative-path assertions so each invalid case explicitly fails the test:
inside the loop replace the "if err != nil { framework.Logf(...) } else { ... }"
logic with an assertion that the create returned an error (e.g.,
o.Expect(err).To(o.HaveOccurred())) OR, if the API accepts the object, poll MCO
status and assert the object reaches ContainerRuntimeConfigFailure=True (use the
existing cleanupContainerRuntimeConfig helper for accepted cases). Apply this
pattern for the current g.It block and the other similar blocks referenced
(lines 91-141, 143-177, 209-255, 257-291, 293-325) so tests fail when invalid
configs are accepted instead of merely logging.

In `@test/extended/node/additional_image_stores.go`:
- Around line 534-587: The test currently only logs failures from
prepopulateImageOnNode and registry pulls instead of failing; change it to
assert failures so the spec enforces use of additionalImageStores: replace the
soft log when prepopulateImageOnNode returns an error with a hard assertion
(e.g., o.Expect(err).NotTo(o.HaveOccurred()) or framework.Failf) referencing
prepopulateImageOnNode, and after scanning pod events use the pulledFromRegistry
boolean to assert the pod did NOT pull from the registry
(o.Expect(pulledFromRegistry).To(o.BeFalse()) or framework.Failf if true) so the
test fails when prepopulation fails or when a registry pull is observed for
testPod.
- Around line 612-615: The tests mutate ContainerRuntimeConfig then immediately
call waitForMCP(...), which can race if the MCP is still marked Updated on the
old config; insert a call to waitForMCPToStartUpdating(ctx, mcClient, "worker",
<suitable timeout>) immediately after each
waitForContainerRuntimeConfigSuccess(...) (and before the subsequent
waitForMCP(...)) to ensure the MCP has begun rolling before asserting readiness.
Update all occurrences mentioned (e.g., the instances around
waitForContainerRuntimeConfigSuccess -> waitForMCP at lines referenced and the
other listed locations) so the sequence is:
waitForContainerRuntimeConfigSuccess(...), waitForMCPToStartUpdating(...), then
waitForMCP(...).
- Around line 848-966: The baseline measurement is contaminated because the
image pulled by baselinePod remains in CRI-O local storage on testNode and
speeds up optimizedPod; after baselinePod finishes (or before creating
optimizedPod) remove the image from the node's local container storage so the
optimized test only benefits from additionalImageStores. Specifically, after
waitForPodDeleted/baseline cleanup and before prepopulateImageOnNode/creating
optimizedPod, run a removal on testNode (via ExecOnNodeWithChroot) to delete the
image (e.g., crictl rmi or podman/rpm command appropriate for CRI-O) targeting
testImageDefault, or alternatively change the baseline/optimized pod placement
or pull policies so the baseline pull occurs on a different node than testNode;
update references: baselinePod, optimizedPod, testNode, testImageDefault,
ExecOnNodeWithChroot, prepopulateImageOnNode, waitForPodDeleted accordingly.
- Around line 47-91: Tests currently treat API acceptance alone as success,
masking validation regressions; update the spec so each invalid-path case fails
unless the Create call returns an error or the MachineConfigOperator surfaces
the ContainerRuntimeConfigFailure=True condition. Concretely, in the test that
calls mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and in
the other similar blocks), change the success branch to poll the MCO for the
ContainerRuntimeConfigFailure condition for the created resource (and only defer
cleanupContainerRuntimeConfig if you will delete it on success), and assert the
test fails if neither the Create returned an error nor the MCO condition becomes
True within a timeout. Include these checks for all analogous test blocks
mentioned (lines ranges in the comment).

In `@test/extended/node/additional_layer_stores.go`:
- Around line 721-725: After mutating the ContainerRuntimeConfig (e.g., after
calling waitForContainerRuntimeConfigSuccess with ctrcfg.Name), ensure you block
until the MCP has left its old steady state by calling
waitForMCPToStartUpdating(ctx, mcClient, "worker", <suitable timeout>) before
calling waitForMCP(ctx, mcClient, "worker", ...); specifically, insert a call to
waitForMCPToStartUpdating between waitForContainerRuntimeConfigSuccess(...) and
waitForMCP(...). Apply the same insertion for the other similar flows referenced
(lines around the other waitForMCP usages) so the final waitForMCP polls the new
rollout rather than returning on the previous rendered config.
- Around line 167-213: The test currently treats successful creation of a
ContainerRuntimeConfig with an invalid path as a warning; change it to fail the
test when the API accepts an invalid path. In the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and
currently logs "Warning: Path ... accepted"), replace that branch to defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and then assert
failure using the test's assertion helpers (e.g.,
o.Expect(err).To(o.HaveOccurred(), "expected create to fail for path %s",
tc.path) or explicitly call g.Fail/fatal) so the test fails if the create
succeeds; keep the existing behavior that logs success when creation is
rejected.
- Around line 1448-1453: Replace the hardcoded personal image string assigned to
eStargzImage with the project-managed image lookup using image.LocationFor(...)
and add that image's repository:tag to the allowedImages registry in the image
utility so CI doesn't depend on a personal quay repo; specifically, add the
stable repository entry to allowedImages (in the image registry map) and change
eStargzImage to use image.LocationFor("your-org/your-image:tag") or the chosen
stable org image name following the same pattern used by other node tests.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 209-218: The test currently treats both success and failure as
passing because it only logs on error; change the branching to assert
explicitly: where you call
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}) (and the similar block at the other occurrence), replace
the current if/else with an assertion that err did not occur (e.g.,
o.Expect(err).NotTo(o.HaveOccurred()) or equivalent) when the spec is supposed
to be accepted, then defer cleanupContainerRuntimeConfig and run the o.Expect
checks on created.Spec...; if the test is intended to assert rejection instead,
assert that err did occur (o.Expect(err).To(HaveOccurred())) and remove the
success-path expectations. Reference symbols:
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create,
cleanupContainerRuntimeConfig,
created.Spec.ContainerRuntimeConfig.AdditionalLayerStores/AdditionalImageStores/AdditionalArtifactStores,
and apply the same change to the second block around the other Create call.
- Around line 563-566: After computing pureWorkers via getPureWorkerNodes
(called after getNodesByLabel), add a guard that fails the test when pureWorkers
is empty; specifically, immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) assert that len(pureWorkers) > 0 (e.g.,
o.Expect(len(pureWorkers)).NotTo(o.BeZero()) with a descriptive message) so
TC12/TC13 don’t silently no-op when there are zero pure workers.
- Around line 363-365: The current test only asserts an error occurred when
creating the ContainerRuntimeConfig but may match unrelated failures; after the
existing o.Expect(err).To(o.HaveOccurred()) add a stronger assertion that the
error message references the specific invalid image path used in the test (the
invalid image path value supplied when building ctrcfg). Locate the Create call
(mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and the
ctrcfg variable and assert err.Error() (or the testing matcher equivalent)
contains the exact invalid image path string used to construct ctrcfg so only
the intended validation rejection passes.
- Around line 70-73: The test is creating real ContainerRuntimeConfig objects
(mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create with
metav1.CreateOptions{}) which triggers MCO reconciliation; change these TC1–TC10
Create calls to use a dry-run by passing metav1.CreateOptions{DryRun:
[]string{"All"}} so the API validates without persisting, and remove or skip the
matching deferred cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) for
those dry-run cases since nothing is created; update every occurrence of
Create(...) and the subsequent defer in the file (references:
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create,
metav1.CreateOptions{}, and cleanupContainerRuntimeConfig) accordingly.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 409-425: The loop over pureWorkers currently swallows errors from
ExecOnNodeWithChroot (used for "systemctl is-active stargz-store" and "mount"),
logging warnings and continuing, which allows Deploy to falsely succeed; change
both error branches to return a wrapped error instead of continuing so the
caller fails fast (include node.Name and the original err in the returned
fmt.Errorf) and keep the existing status check that returns an error when the
service is not "active"; update the branches around ExecOnNodeWithChroot calls
in this block to propagate errors rather than logging and continuing.
- Around line 79-107: The Cleanup method (StargzStoreSetup.Cleanup) currently
only deletes the namespace (s.namespace) and must also undo node-level changes:
for each worker node reachable via s.oc (e.g., using the cluster's node exec/SSH
helper), stop and disable the stargz-store systemd unit, remove the installed
binary and config directories (the host paths the deployment created), reload
systemd, and restart CRI-O to clear side effects; perform these steps before or
after namespace delete, ensure s.deployed is set false only after node cleanup,
and log warnings on non-fatal failures while returning an error only on fatal
failures.
- Around line 45-75: Set the setup as deployed earlier and make cleanup safe to
call on partial failures: in Deploy, mark s.deployed = true immediately after
the cluster-mutating resources exist (e.g., after successful createNamespace and
createConfigMap or at least right after createDaemonSet is called) so
AfterEach/Cleanup will run even on later verification failures; also update
Cleanup/IsDeployed/Cleanup helpers to be idempotent (safe to delete
namespace/daemonset/configmap even if some resources were never created) and
make createDaemonSet, waitForDaemonSetReady, and verifyStargzStoreRunning
tolerant of repeated cleanup attempts.
- Around line 217-220: The test currently downloads the stargz-snapshotter
binary via curl (the stargz-snapshotter release fetch block); remove that
external GitHub download and instead load the binary from a cluster-local or
prepackaged source: change the setup to first check a configurable local path or
environment variable (e.g., STARGZ_SNAPSHOTTER_BIN) and use that binary if
present, otherwise fall back to a cluster-local payload URL served inside CI;
also remove the curl/URL usage and add a clear error message if neither local
binary nor cluster payload is available so tests fail fast in restricted-network
environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 626a6a31-816a-4934-9431-1563660f7dba

📥 Commits

Reviewing files that changed from the base of the PR and between b6f09c2 and bf419f3.

📒 Files selected for processing (7)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go

Comment thread pkg/testsuites/standard_suites.go
Comment on lines +45 to +89
// TC1: Validate Path Format Restrictions
g.It("should reject invalid path formats for additionalArtifactStores [TC1]", func(ctx context.Context) {
mcClient, err := mcclient.NewForConfig(oc.KubeFramework().ClientConfig())
o.Expect(err).NotTo(o.HaveOccurred())

invalidPaths := []struct {
name string
path string
description string
}{
{"relative-path", "relative/path", "relative path without leading slash"},
{"empty-path", "", "empty path"},
}

for _, tc := range invalidPaths {
g.By(fmt.Sprintf("Testing invalid path: %s (%s)", tc.path, tc.description))
ctrcfg := &machineconfigv1.ContainerRuntimeConfig{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("artifact-invalid-path-test-%s", tc.name),
},
Spec: machineconfigv1.ContainerRuntimeConfigSpec{
MachineConfigPoolSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"pools.operator.machineconfiguration.openshift.io/worker": "",
},
},
ContainerRuntimeConfig: &machineconfigv1.ContainerRuntimeConfiguration{
AdditionalArtifactStores: []machineconfigv1.AdditionalArtifactStore{
{Path: machineconfigv1.StorePath(tc.path)},
},
},
},
}

_, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
if err != nil {
framework.Logf("Path '%s' correctly rejected at API level: %v", tc.path, err)
} else {
defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
framework.Logf("Path '%s' accepted at API level, checking MCO validation", tc.path)
}
}

framework.Logf("Test PASSED: Invalid path formats handled correctly")
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These “should reject …” specs do not actually fail when the invalid config is accepted.

Several negative-path tests only log a warning, or poll the status without asserting that MCO set a failure condition. If validation regresses, the suite will still go green. Each of these specs should Expect a create error or explicitly Expect ContainerRuntimeConfigFailure=True.

Also applies to: 91-141, 143-177, 209-255, 257-291, 293-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 45 - 89, The
test currently only logs when an invalid ContainerRuntimeConfig (ctrcfg) is
accepted instead of failing the spec; change the negative-path assertions so
each invalid case explicitly fails the test: inside the loop replace the "if err
!= nil { framework.Logf(...) } else { ... }" logic with an assertion that the
create returned an error (e.g., o.Expect(err).To(o.HaveOccurred())) OR, if the
API accepts the object, poll MCO status and assert the object reaches
ContainerRuntimeConfigFailure=True (use the existing
cleanupContainerRuntimeConfig helper for accepted cases). Apply this pattern for
the current g.It block and the other similar blocks referenced (lines 91-141,
143-177, 209-255, 257-291, 293-325) so tests fail when invalid configs are
accepted instead of merely logging.

Comment thread test/extended/node/additional_artifact_stores.go
Comment thread test/extended/node/additional_image_stores.go
Comment thread test/extended/node/additional_image_stores.go Outdated
Comment on lines +563 to +566
workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
o.Expect(err).NotTo(o.HaveOccurred())
pureWorkers := getPureWorkerNodes(workerNodes)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard TC12/TC13 against zero pure workers.

Line 565 and Line 660 compute pureWorkers but don’t enforce non-empty. If it’s empty, setup and verification loops can no-op and the test passes without validation.

Proposed guard (same pattern in TC12 and TC13)
 workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
 o.Expect(err).NotTo(o.HaveOccurred())
 pureWorkers := getPureWorkerNodes(workerNodes)
+if len(pureWorkers) < 1 {
+    e2eskipper.Skipf("Need at least 1 pure worker node for this test")
+}

Also applies to: 658-661

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_stores_combined.go` around lines 563 - 566,
After computing pureWorkers via getPureWorkerNodes (called after
getNodesByLabel), add a guard that fails the test when pureWorkers is empty;
specifically, immediately after pureWorkers := getPureWorkerNodes(workerNodes)
assert that len(pureWorkers) > 0 (e.g.,
o.Expect(len(pureWorkers)).NotTo(o.BeZero()) with a descriptive message) so
TC12/TC13 don’t silently no-op when there are zero pure workers.

Comment on lines +45 to +75
func (s *StargzStoreSetup) Deploy(ctx context.Context) error {
framework.Logf("Deploying stargz-store to cluster...")

// Create namespace
if err := s.createNamespace(ctx); err != nil {
return fmt.Errorf("failed to create namespace: %w", err)
}

// Create ConfigMap with stargz-store config and systemd service
if err := s.createConfigMap(ctx); err != nil {
return fmt.Errorf("failed to create configmap: %w", err)
}

// Create DaemonSet
if err := s.createDaemonSet(ctx); err != nil {
return fmt.Errorf("failed to create daemonset: %w", err)
}

// Wait for DaemonSet to be ready
if err := s.waitForDaemonSetReady(ctx, 10*time.Minute); err != nil {
return fmt.Errorf("failed waiting for daemonset: %w", err)
}

// Verify stargz-store is running on nodes
if err := s.verifyStargzStoreRunning(ctx); err != nil {
return fmt.Errorf("stargz-store verification failed: %w", err)
}

s.deployed = true
framework.Logf("stargz-store deployed successfully")
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial deploy failures will leak cluster state.

AfterEach only calls Cleanup when IsDeployed() is true, but deployed is set only after all verification succeeds. If createDaemonSet, rollout, or node verification fails after any host mutation, cleanup is skipped entirely. Make cleanup idempotent and mark the setup as deployed as soon as the namespace/resources exist.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 45 - 75, Set the setup
as deployed earlier and make cleanup safe to call on partial failures: in
Deploy, mark s.deployed = true immediately after the cluster-mutating resources
exist (e.g., after successful createNamespace and createConfigMap or at least
right after createDaemonSet is called) so AfterEach/Cleanup will run even on
later verification failures; also update Cleanup/IsDeployed/Cleanup helpers to
be idempotent (safe to delete namespace/daemonset/configmap even if some
resources were never created) and make createDaemonSet, waitForDaemonSetReady,
and verifyStargzStoreRunning tolerant of repeated cleanup attempts.

Comment on lines +79 to +107
func (s *StargzStoreSetup) Cleanup(ctx context.Context) error {
if !s.deployed {
return nil
}

framework.Logf("Cleaning up stargz-store...")

// Delete namespace (cascades to all resources)
err := s.oc.AdminKubeClient().CoreV1().Namespaces().Delete(ctx, s.namespace, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
framework.Logf("Warning: failed to delete namespace %s: %v", s.namespace, err)
}

// Wait for namespace to be deleted
err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
_, err := s.oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, s.namespace, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
return false, nil
})

if err != nil {
framework.Logf("Warning: namespace deletion timed out: %v", err)
}

s.deployed = false
framework.Logf("stargz-store cleanup completed")
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Cleanup does not actually uninstall stargz-store from the nodes.

This only deletes the namespace. The binary, config, systemd unit, enabled service, and CRI-O restart side effects all remain on each worker, so later specs inherit modified nodes and can pass or fail for the wrong reason. Cleanup needs to stop/disable the service on every worker, remove the host files/directories, reload systemd, and restart CRI-O before dropping the namespace.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 79 - 107, The Cleanup
method (StargzStoreSetup.Cleanup) currently only deletes the namespace
(s.namespace) and must also undo node-level changes: for each worker node
reachable via s.oc (e.g., using the cluster's node exec/SSH helper), stop and
disable the stargz-store systemd unit, remove the installed binary and config
directories (the host paths the deployment created), reload systemd, and restart
CRI-O to clear side effects; perform these steps before or after namespace
delete, ensure s.deployed is set false only after node cleanup, and log warnings
on non-fatal failures while returning an error only on fatal failures.

Comment on lines +217 to +220
# Download stargz-snapshotter release
echo "Downloading stargz-snapshotter %s..."
curl -L -o /tmp/stargz.tar.gz \
https://github.com/containerd/stargz-snapshotter/releases/download/%s/stargz-snapshotter-%s-linux-amd64.tar.gz
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the target file and surrounding context
head -n 230 test/extended/node/stargz_store_setup.go | tail -n 25

Repository: openshift/origin

Length of output: 949


🏁 Script executed:

# Search for references to stargz in the test setup
rg "stargz" test/extended/node/ -A 3 -B 3

Repository: openshift/origin

Length of output: 50372


🏁 Script executed:

# Check if there are any test image definitions or Dockerfiles
fd -e Dockerfile -e dockerfile test/

Repository: openshift/origin

Length of output: 386


🏁 Script executed:

# Look for environment variable handling or cluster-local payload alternatives
rg "stargz|cluster.*local|payload" test/extended/ --type go -i

Repository: openshift/origin

Length of output: 50374


Remove GitHub release download dependency during test initialization.

The curl download from a public GitHub release (lines 217-220) introduces an external internet dependency that makes the test vulnerable to network flakiness and restricted network policies. In CI environments with limited or no external connectivity, this turns the test suite into an infrastructure verification rather than a feature verification.

Pre-package the stargz-snapshotter binary (v0.18.2) in the test container image or implement a cluster-local payload source with a fallback mechanism to handle environments where GitHub is unreachable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 217 - 220, The test
currently downloads the stargz-snapshotter binary via curl (the
stargz-snapshotter release fetch block); remove that external GitHub download
and instead load the binary from a cluster-local or prepackaged source: change
the setup to first check a configurable local path or environment variable
(e.g., STARGZ_SNAPSHOTTER_BIN) and use that binary if present, otherwise fall
back to a cluster-local payload URL served inside CI; also remove the curl/URL
usage and add a clear error message if neither local binary nor cluster payload
is available so tests fail fast in restricted-network environments.

Comment on lines +409 to +425
for _, node := range pureWorkers {
// Check stargz-store service status
output, err := ExecOnNodeWithChroot(s.oc, node.Name, "systemctl", "is-active", "stargz-store")
if err != nil {
framework.Logf("Warning: failed to check stargz-store status on node %s: %v", node.Name, err)
continue
}

if strings.TrimSpace(output) != "active" {
return fmt.Errorf("stargz-store is not active on node %s (status: %s)", node.Name, strings.TrimSpace(output))
}

// Check FUSE mount
mountOutput, err := ExecOnNodeWithChroot(s.oc, node.Name, "mount")
if err != nil {
framework.Logf("Warning: failed to check mounts on node %s: %v", node.Name, err)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail verification on per-node command errors instead of logging and continuing.

If ExecOnNodeWithChroot fails on a worker, Deploy can still report success even though that node was never validated. That produces false greens and very hard-to-debug downstream failures in the E2E specs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 409 - 425, The loop
over pureWorkers currently swallows errors from ExecOnNodeWithChroot (used for
"systemctl is-active stargz-store" and "mount"), logging warnings and
continuing, which allows Deploy to falsely succeed; change both error branches
to return a wrapped error instead of continuing so the caller fails fast
(include node.Name and the original err in the returned fmt.Errorf) and keep the
existing status check that returns an error when the service is not "active";
update the branches around ExecOnNodeWithChroot calls in this block to propagate
errors rather than logging and continuing.

@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from bf419f3 to f09da80 Compare April 28, 2026 07:43
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
test/extended/node/additional_layer_stores.go (3)

167-213: ⚠️ Potential issue | 🟠 Major

TC5 still allows a green pass when invalid paths are accepted.

In the success branch, the test only logs a warning and continues, so invalid-character validation regressions won’t fail the spec.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 167 - 213, The
test loop currently treats acceptance of invalid paths as a warning instead of
failing the spec; update the loop in the "should reject path with invalid
characters for additionalLayerStores [TC5]" test so that when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create returns no
error for a test case (i.e., the API accepted the invalid path), the test fails
(use the test assertion helper, e.g., o.Expect(err).To(o.HaveOccurred()) or
o.Expect(false).To(o.BeTrue()) with a clear message) rather than merely logging
a warning; keep the cleanup via cleanupContainerRuntimeConfig(ctx, mcClient,
ctrcfg.Name) for the success branch but ensure the assertion runs so the spec
fails on regressions.

721-725: ⚠️ Potential issue | 🟠 Major

Several mutation flows still miss MCP “start updating” synchronization.

Direct waitForMCP(...) calls here can complete against the pre-change state and race subsequent assertions.

Also applies to: 738-740, 791-795, 815-819, 861-866, 938-942, 1049-1053, 1129-1135, 1192-1212, 1496-1500, 1726-1730

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 721 - 725, The
direct waitForMCP(...) calls can race the MCP "start updating" transition;
before each waitForMCP(ctx, mcClient, "worker", ...), insert a call to a
synchronization helper that waits for the MCP to enter the "Updating" state
(e.g., call waitForMCPStartUpdating(ctx, mcClient, "worker", <short timeout>))
so the subsequent waitForMCP observes the in-progress update rather than the
pre-change state; update every occurrence (the shown call and the other
duplicated spots) to perform waitForMCPStartUpdating(...) immediately before
waitForMCP(...).

1435-1453: ⚠️ Potential issue | 🟠 Major

Replace the personal external image dependency with a project-managed image lookup.

Hardcoding quay.io/bgudi/test-small:estargz makes CI fragile (ownership/tag availability risk). Prefer a repository-managed image via the standard test-image registry mechanism.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 1435 - 1453,
Replace the hardcoded personal image string assigned to eStargzImage in the test
(additional_layer_stores.go) with the project-managed test-image lookup helper
(instead of "quay.io/bgudi/test-small:estargz"); call the centralized test image
registry/helper used elsewhere in the repo (for example the project’s test image
getter or framework image helper) to obtain the estargz test image, assign its
result to eStargzImage, and remove the literal so CI uses a repository-managed
image tag.
test/extended/node/additional_image_stores.go (4)

612-616: ⚠️ Potential issue | 🟠 Major

Rollout wait is racy without MCP “start updating” gate.

These flows mutate ContainerRuntimeConfig and then immediately call waitForMCP(...), which can observe the prior steady state and return early.

Proposed sequence for each mutation path
 err = waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, 5*time.Minute)
 o.Expect(err).NotTo(o.HaveOccurred())

+err = waitForMCPToStartUpdating(ctx, mcClient, "worker", 5*time.Minute)
+o.Expect(err).NotTo(o.HaveOccurred())
+
 err = waitForMCP(ctx, mcClient, "worker", 25*time.Minute)
 o.Expect(err).NotTo(o.HaveOccurred())

Also applies to: 629-631, 682-686, 706-710, 753-757, 809-813, 1028-1032, 1100-1104, 1210-1214, 1290-1296, 1353-1357, 1369-1373

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 612 - 616, The
test is racy because after mutating ContainerRuntimeConfig (ctrcfg.Name) the
code calls waitForMCP(...) immediately and may see the old steady state; update
each mutation flow (where waitForContainerRuntimeConfigSuccess(ctx, mcClient,
ctrcfg.Name, ...) is followed by waitForMCP(ctx, mcClient, "worker", ...)) to
first wait for the MCP to enter the "updating" phase (i.e., a start-updating
gate) before calling waitForMCP for completion: add a helper/check that polls
the MachineConfigPool/MC status for .status.degraded===false and
.status.conditions includes an Updating/Progressing condition (or reuse an
existing waitForMCPState helper if present) using mcClient and ctrcfg.Name, then
only call waitForMCP(ctx, mcClient, "worker", ...) to wait for finished rollout;
apply this change to all similar sites (the listed line groups).

47-91: ⚠️ Potential issue | 🟠 Major

Negative-path specs can pass even when invalid configs are accepted.

These branches only log on successful create and/or don’t assert the poll result, so validation regressions can still report green.

Proposed pattern fix
- _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
- if err != nil {
-   framework.Logf("... rejected ...")
- } else {
-   defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
-   framework.Logf("... accepted ..., checking MCO validation")
- }
+ _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+ if err == nil {
+   defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
+   err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+     cfg, getErr := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Get(ctx, ctrcfg.Name, metav1.GetOptions{})
+     if getErr != nil {
+       return false, getErr
+     }
+     for _, condition := range cfg.Status.Conditions {
+       if condition.Type == machineconfigv1.ContainerRuntimeConfigFailure && condition.Status == corev1.ConditionTrue {
+         return true, nil
+       }
+     }
+     return false, nil
+   })
+ }
+ o.Expect(err).To(o.HaveOccurred(), "invalid config must be rejected by API or by MCO failure condition")

Also applies to: 120-142, 145-179, 211-257, 259-293, 295-327

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 47 - 91, The test
currently only logs when an invalid ContainerRuntimeConfig (ctrcfg) is accepted,
so create success can silently pass; update the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to assert
outcomes: if Create returns an error, keep the current log; if Create succeeds,
immediately defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
then poll the MCO validation (replace the current log path) — use an assertion
(o.Expect / g.Eventually) to fail the test if the config remains accepted by the
MCO (i.e., wait for the object to be rejected/rolled back or for the MCO to
surface an error), and fail the test instead of just logging; ensure these
checks reference ctrcfg, ContainerRuntimeConfigs().Create, and
cleanupContainerRuntimeConfig so the validation branch cannot silently pass.

535-538: ⚠️ Potential issue | 🟠 Major

Feature assertion is soft; tests still pass on prepopulation or fallback pull failures.

Both TC10 and TC15 only log when prepopulation fails or registry pulls occur, so the core “use additionalImageStores” behavior isn’t enforced.

Also applies to: 559-587, 933-937, 989-999

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 535 - 538, The
tests currently only log failures from prepopulateImageOnNode and registry pulls
(e.g., calls using prepopulateImageOnNode(ctx, oc, testNode, testImageDefault,
additionalImageStorePath)), making the feature check soft; change those log-only
branches to assert/fail the test instead (use the test framework's failure
helper such as framework.Failf or ExpectNoError) so the test fails when
prepopulation or pull fails; apply the same change to the other occurrences
where prepopulateImageOnNode or registry pull errors are merely logged (the
duplicated call sites mentioned) so the behavior is actually enforced rather
than only advertised in logs.

850-883: ⚠️ Potential issue | 🟠 Major

Performance comparison is biased by baseline cache warm-up.

The baseline pod pulls the same image on the same node, but local image cache is not cleared before the optimized run. The “optimized faster” result can be unrelated to additionalImageStores.

Also applies to: 939-987

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 850 - 883, The
baseline measurement is biased by a warm image cache because the same image
(testImageDefault) is pulled on the node and not removed before the optimized
run; fix by ensuring a consistent cache state before each measurement: remove
the image from the node(s) (or use a fresh unique image tag) prior to creating
baselinePod and again prior to the optimized run so both runs start from the
same cold-cache state. Locate the pod creation flow (baselinePod,
waitForPodRunning) and add a pre-step that evicts the image (for example by
running a node-level image remove via the test harness/oc or executing a
privileged pod that runs the container runtime image remove) or switch to using
a uniquely tagged test image for each run; apply the same change for the similar
block around the second range referenced in the comment.
test/extended/node/additional_artifact_stores.go (2)

45-89: ⚠️ Potential issue | 🟠 Major

Invalid-input tests can false-pass when bad configs are accepted.

Several “should reject …” specs only log acceptance (or poll without asserting final failure), so regressions can pass undetected.

Also applies to: 91-141, 143-177, 209-255, 257-291, 293-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 45 - 89, The
test currently just logs when an invalid ContainerRuntimeConfig is accepted (in
the loop around invalidPaths) which can false-pass; change the logic so that
after mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
returns nil you do not merely log but explicitly fail the test if MCO does not
reject the bad config: use the Gomega/o.Expect (or g.Fail) to assert the
creation should not succeed, and if you must accept API creation for negative
tests then replace the log+defer with a deterministic check that the MCO rejects
the config (polling status/conditions or checking the rendered/validated state)
and call o.Expect(...).To(o.BeTrue())/o.Fail() on the rejection condition; keep
cleanupContainerRuntimeConfig to remove any created resources on failure. Ensure
this pattern is applied to the invalid-path loop and the other similar blocks
referenced (the other test ranges).

460-481: ⚠️ Potential issue | 🟠 Major

Add MCP update-start gating before readiness waits.

After CRC create/update/delete, these tests call waitForMCP(...) directly. That can return on the old rendered state before rollout actually starts.

Also applies to: 548-567, 618-646, 688-693, 744-749, 810-814, 920-924, 1000-1006, 1063-1083

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 460 - 481, The
test must gate the rollout readiness wait with an MCP "update-start" check after
creating/updating/deleting the ContainerRuntimeConfig and after you discover the
generated MachineConfig (variables/functions: ContainerRuntimeConfig,
generatedMCName, mcClient, waitForMCP); add a short wait that asserts the worker
MachineConfigPool has begun rolling out the new config (for example
implement/use a helper like waitForMCPUpdateStart(ctx, mcClient, "worker",
generatedMCName, shortTimeout) or poll the worker MachineConfigPool status and
return when its current/configuration name equals generatedMCName and
UpdatedMachineCount < MachineCount or when a Condition/Updating flag is set)
before calling the existing waitForMCP(...) readiness waiter so the test no
longer returns the old rendered state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 883: The call to waitForPodDeleted(ctx, oc, baselinePod.Name,
2*time.Minute) ignores its returned error which can mask pod-deletion timing
failures; update the test to capture and assert the error (e.g., use
require.NoError/Expect/if err != nil { t.Fatalf(...) }) immediately after
calling waitForPodDeleted with the same ctx, oc and baselinePod.Name so the test
fails fast on deletion timeouts instead of proceeding and introducing racey
cleanup behavior.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 1605-1623: The test currently ignores errors from
ExecOnNodeWithChroot calls (used for "crictl rmi", "crictl images", and "ls
/var/lib/stargz-store/store/"), which can hide failures; modify the code around
the ExecOnNodeWithChroot invocations that set rmiOutput, crioImages, and
storeAfterDelete to capture and check the returned error and fail the test when
non-nil (e.g., use o.Expect(err).NotTo(o.HaveOccurred()) or return the error),
and ensure any useful output (rmiOutput, crioImages, storeAfterDelete) is still
logged only after the error check so the test reliably asserts that the CRI-O
image removal and subsequent verifications actually succeeded.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 45-89: The test currently just logs when an invalid
ContainerRuntimeConfig is accepted (in the loop around invalidPaths) which can
false-pass; change the logic so that after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...) returns
nil you do not merely log but explicitly fail the test if MCO does not reject
the bad config: use the Gomega/o.Expect (or g.Fail) to assert the creation
should not succeed, and if you must accept API creation for negative tests then
replace the log+defer with a deterministic check that the MCO rejects the config
(polling status/conditions or checking the rendered/validated state) and call
o.Expect(...).To(o.BeTrue())/o.Fail() on the rejection condition; keep
cleanupContainerRuntimeConfig to remove any created resources on failure. Ensure
this pattern is applied to the invalid-path loop and the other similar blocks
referenced (the other test ranges).
- Around line 460-481: The test must gate the rollout readiness wait with an MCP
"update-start" check after creating/updating/deleting the ContainerRuntimeConfig
and after you discover the generated MachineConfig (variables/functions:
ContainerRuntimeConfig, generatedMCName, mcClient, waitForMCP); add a short wait
that asserts the worker MachineConfigPool has begun rolling out the new config
(for example implement/use a helper like waitForMCPUpdateStart(ctx, mcClient,
"worker", generatedMCName, shortTimeout) or poll the worker MachineConfigPool
status and return when its current/configuration name equals generatedMCName and
UpdatedMachineCount < MachineCount or when a Condition/Updating flag is set)
before calling the existing waitForMCP(...) readiness waiter so the test no
longer returns the old rendered state.

In `@test/extended/node/additional_image_stores.go`:
- Around line 612-616: The test is racy because after mutating
ContainerRuntimeConfig (ctrcfg.Name) the code calls waitForMCP(...) immediately
and may see the old steady state; update each mutation flow (where
waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, ...) is
followed by waitForMCP(ctx, mcClient, "worker", ...)) to first wait for the MCP
to enter the "updating" phase (i.e., a start-updating gate) before calling
waitForMCP for completion: add a helper/check that polls the
MachineConfigPool/MC status for .status.degraded===false and .status.conditions
includes an Updating/Progressing condition (or reuse an existing waitForMCPState
helper if present) using mcClient and ctrcfg.Name, then only call
waitForMCP(ctx, mcClient, "worker", ...) to wait for finished rollout; apply
this change to all similar sites (the listed line groups).
- Around line 47-91: The test currently only logs when an invalid
ContainerRuntimeConfig (ctrcfg) is accepted, so create success can silently
pass; update the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to assert
outcomes: if Create returns an error, keep the current log; if Create succeeds,
immediately defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
then poll the MCO validation (replace the current log path) — use an assertion
(o.Expect / g.Eventually) to fail the test if the config remains accepted by the
MCO (i.e., wait for the object to be rejected/rolled back or for the MCO to
surface an error), and fail the test instead of just logging; ensure these
checks reference ctrcfg, ContainerRuntimeConfigs().Create, and
cleanupContainerRuntimeConfig so the validation branch cannot silently pass.
- Around line 535-538: The tests currently only log failures from
prepopulateImageOnNode and registry pulls (e.g., calls using
prepopulateImageOnNode(ctx, oc, testNode, testImageDefault,
additionalImageStorePath)), making the feature check soft; change those log-only
branches to assert/fail the test instead (use the test framework's failure
helper such as framework.Failf or ExpectNoError) so the test fails when
prepopulation or pull fails; apply the same change to the other occurrences
where prepopulateImageOnNode or registry pull errors are merely logged (the
duplicated call sites mentioned) so the behavior is actually enforced rather
than only advertised in logs.
- Around line 850-883: The baseline measurement is biased by a warm image cache
because the same image (testImageDefault) is pulled on the node and not removed
before the optimized run; fix by ensuring a consistent cache state before each
measurement: remove the image from the node(s) (or use a fresh unique image tag)
prior to creating baselinePod and again prior to the optimized run so both runs
start from the same cold-cache state. Locate the pod creation flow (baselinePod,
waitForPodRunning) and add a pre-step that evicts the image (for example by
running a node-level image remove via the test harness/oc or executing a
privileged pod that runs the container runtime image remove) or switch to using
a uniquely tagged test image for each run; apply the same change for the similar
block around the second range referenced in the comment.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 167-213: The test loop currently treats acceptance of invalid
paths as a warning instead of failing the spec; update the loop in the "should
reject path with invalid characters for additionalLayerStores [TC5]" test so
that when mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create
returns no error for a test case (i.e., the API accepted the invalid path), the
test fails (use the test assertion helper, e.g.,
o.Expect(err).To(o.HaveOccurred()) or o.Expect(false).To(o.BeTrue()) with a
clear message) rather than merely logging a warning; keep the cleanup via
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) for the success branch
but ensure the assertion runs so the spec fails on regressions.
- Around line 721-725: The direct waitForMCP(...) calls can race the MCP "start
updating" transition; before each waitForMCP(ctx, mcClient, "worker", ...),
insert a call to a synchronization helper that waits for the MCP to enter the
"Updating" state (e.g., call waitForMCPStartUpdating(ctx, mcClient, "worker",
<short timeout>)) so the subsequent waitForMCP observes the in-progress update
rather than the pre-change state; update every occurrence (the shown call and
the other duplicated spots) to perform waitForMCPStartUpdating(...) immediately
before waitForMCP(...).
- Around line 1435-1453: Replace the hardcoded personal image string assigned to
eStargzImage in the test (additional_layer_stores.go) with the project-managed
test-image lookup helper (instead of "quay.io/bgudi/test-small:estargz"); call
the centralized test image registry/helper used elsewhere in the repo (for
example the project’s test image getter or framework image helper) to obtain the
estargz test image, assign its result to eStargzImage, and remove the literal so
CI uses a repository-managed image tag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 60404f2d-f230-4e26-93ed-7e3d1c85b473

📥 Commits

Reviewing files that changed from the base of the PR and between bf419f3 and f09da80.

📒 Files selected for processing (6)
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go
✅ Files skipped from review due to trivial changes (1)
  • test/extended/node/stargz_store_setup.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/node/additional_stores_combined.go

Comment thread test/extended/node/additional_image_stores.go Outdated
Comment thread test/extended/node/additional_layer_stores.go Outdated
@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from f09da80 to af7b257 Compare April 30, 2026 09:50
@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

/test all

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (6)
test/extended/node/stargz_store_setup.go (4)

411-425: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on per-node verification command errors.

If ExecOnNodeWithChroot(...) fails on a worker, Deploy() currently logs and continues. That allows success without actually verifying every targeted node.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 411 - 425, The
per-node checks in Deploy currently swallow ExecOnNodeWithChroot errors (they
call framework.Logf and continue), which can allow Deploy to report success
without verifying all nodes; change the behavior so that when
ExecOnNodeWithChroot returns a non-nil err (both for the "systemctl is-active"
call and the "mount" call) the function returns an error immediately (e.g.,
fmt.Errorf) that includes the node name and the original error; update the logic
around ExecOnNodeWithChroot usages in Deploy to fail-fast instead of continuing
so calling code knows verification failed.

79-107: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Cleanup only deletes the namespace; it does not revert host changes.

The service, binary, systemd unit, config, mount, and CRI-O side effects remain on each worker after Cleanup(). That contaminates later specs with modified nodes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 79 - 107, The Cleanup
method on StargzStoreSetup only deletes the namespace but must also undo
host-level changes applied during deployment: update Cleanup to iterate the
affected nodes and idempotently stop/disable the stargz-store systemd unit,
remove the installed binary and config files, unmount any mounts created, revert
any CRI-O config changes (and reload/restart CRI-O/kubelet as needed), and
remove any sidecar/static-pod artifacts; use the StargzStoreSetup.Cleanup
function to call helper routines (or add new helpers) that run the required
commands via the existing cluster exec/ssh mechanism (referencing s.namespace,
s.oc, s.deployed and the Cleanup method) and ensure each step logs failures but
continues so cleanup is best-effort and safe to re-run.

217-220: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid downloading the stargz binary from public GitHub during test setup.

This makes the suite depend on external internet reachability and GitHub availability instead of just cluster behavior. A prepackaged or cluster-local artifact source would make the test much more stable in CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 217 - 220, Replace the
external GitHub curl download (the curl -L -o /tmp/stargz.tar.gz and the
hardcoded https://github.com/.../stargz-snapshotter-%s-linux-amd64.tar.gz URL
and the echo "Downloading stargz-snapshotter %s...") with a
cluster-local/prepackaged artifact lookup: read the stargz-snapshotter tarball
from the test artifact store or a local registry (e.g., a helper like
getLocalArtifact("stargz-snapshotter", version) or the test fixture artifact
directory) and write it to /tmp/stargz.tar.gz, failing fast if the artifact is
missing; ensure any code that expects the filename
stargz-snapshotter-%s-linux-amd64.tar.gz still works with the local artifact
source.

48-75: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Partial deploy failures can still leak cluster state.

deployed flips only after all verification succeeds, but node mutation starts much earlier. If namespace/configmap/daemonset creation succeeds and a later step fails, callers using IsDeployed() will skip cleanup entirely.

Also applies to: 79-83

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 48 - 75, The current
flow only sets s.deployed = true after all verification succeeds, so partial
failures (namespace/configmap/daemonset created but verification failed) leave
resources in cluster because IsDeployed() returns false; change the logic to
mark the deployment as present as soon as cluster-affecting resources are
created: set s.deployed = true immediately after successful createDaemonSet (or
right after createConfigMap/createDaemonSet sequence) and before calling
waitForDaemonSetReady/verifyStargzStoreRunning, and mirror the same change in
the similar block around lines 79-83 so callers invoking IsDeployed() can
trigger cleanup when verification fails.
test/extended/node/additional_artifact_stores.go (1)

79-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still do not fail when invalid configs are accepted.

Several should reject ... cases only log success on API rejection and otherwise continue, and TC2 still does not assert the MCO failure path. A validation regression can still produce a green run here.

Also applies to: 118-140, 168-177, 246-255, 283-291, 317-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 89, The
test currently only logs when an invalid ContainerRuntimeConfig is accepted
(using mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
cleanupContainerRuntimeConfig) instead of failing; update each negative case
(including the blocks around the ranges noted) to assert failure when the API
accepts an invalid config by calling t.Fatalf or framework.Failf (so the test
fails immediately) and, for cases that expect MCO-level rejection (e.g., TC2),
add the additional wait/assert logic to verify the MCO validation rejects the
config (polling the MCO status or relevant resource and asserting the expected
failure) rather than just logging “checking MCO validation”; ensure each branch
that currently only logs on API acceptance is replaced with an explicit test
failure or a follow-up MCO validation assertion referencing tc.path, the Create
call, and cleanupContainerRuntimeConfig to locate the affected code.
test/extended/node/additional_image_stores.go (1)

80-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still go green when invalid configs are accepted.

Several should reject ... cases only log acceptance, and TC2 also ignores the result of the MCO polling path. If API validation regresses, this suite can still pass without ever asserting ContainerRuntimeConfigFailure=True.

Also applies to: 119-140, 169-177, 247-255, 284-291, 318-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 80 - 89, The test
currently only logs when invalid ContainerRuntimeConfig paths are accepted by
the API (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and never
fails the test if MCO validation doesn't set ContainerRuntimeConfigFailure;
update the test logic so that for each tc.path: if Create returns an error,
assert/log that as expected, but if Create succeeds (and you defer
cleanupContainerRuntimeConfig), actively poll the MCO status for the created
config (check for the ContainerRuntimeConfigFailure condition on the
corresponding object) within a timeout and fail the test if the failure
condition is not observed; apply this change to the other similar blocks
referenced (the ranges around lines 119-140, 169-177, 247-255, 284-291, 318-325)
so all “should reject” cases assert MCO-level rejection rather than only logging
acceptance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-570: After calling getPureWorkerNodes(workerNodes) check
whether pureWorkers is empty and skip the spec if so; specifically, add a guard
like if len(pureWorkers) == 0 { o.Skip("no pure worker nodes available; skipping
spec") } immediately after the pureWorkers := getPureWorkerNodes(workerNodes)
line in the block that uses getNodesByLabel and getPureWorkerNodes (this
prevents the node-verification loops from being vacuously successful). Apply the
same guard to the other occurrence where getPureWorkerNodes is used (the block
around the other call at the later occurrence).
- Around line 540-545: Replace the fragile "cat" check with a command that
succeeds but only prints when the file exists so we can assert absence
specifically: use ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "if [ -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores ]; then echo present;
fi") and then assert the command returned no error and produced an empty stdout
(meaning the file is absent); keep framework.Logf("Node %s: CRI-O config file
removed successfully", node.Name) and update the expectation to check output ==
"" rather than just expecting ExecOnNodeWithChroot to fail.

In `@test/extended/node/additional_image_stores.go`:
- Around line 718-720: The tests call getNodesByLabel(...) and derive
pureWorkers via getPureWorkerNodes(workerNodes) but don't skip when pureWorkers
is empty, causing SNO/compact clusters to exercise no-op loops and falsely pass;
after computing pureWorkers (the variable), add a guard like if len(pureWorkers)
== 0 { o.Skip("no pure worker nodes for this cluster topology") } where the
first instance occurs (around the getNodesByLabel/getPureWorkerNodes usage) and
apply the same guard at the other similar location that computes pureWorkers
later in the file so these specs are skipped when no pure workers exist.
- Around line 631-644: The hard 2x timing assertion using podStartupTime and
pod2StartupTime is too brittle; change the Expect that currently uses
o.BeNumerically("<", pod2StartupTime/2) to a relaxed check (e.g.,
o.BeNumerically("<", pod2StartupTime)) or remove the numeric timing assertion
entirely and rely on the existing event/source checks for additionalImageStores;
also update the failure message near the Expect and the Logf that prints
"Speedup" to avoid implying a strict 2x requirement.
- Around line 550-576: The test reads Events once right after
waitForPodRunning(...) which can miss the asynchronous "Pulled" event; change
the logic around oc.AdminKubeClient().CoreV1().Events(...).List(...) and the
foundAlreadyPresentEvent check to poll until a timeout (e.g., using
framework/poll or wait.PollImmediate) and retry listing events and scanning for
event.Reason == "Pulled" and the "already present on machine and can be accessed
by the pod" substring, failing only after the timeout; apply the same polling
fix to the other similar block that checks events (the block around the second
occurrence noted in the comment, currently at lines 607-629) so both checks
reliably wait for the Pulled event before asserting using
additionalImageStorePath and testPod.
- Around line 583-599: After calling Delete on testPod, wait for the pod to be
fully removed (or to reach a terminal state) before running the podman rmi and
creating the fallback pod: poll with wait.PollImmediate
(k8s.io/apimachinery/pkg/util/wait) against
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions{}) and loop until the Get returns a NotFound error or the
pod.Status.Phase is Succeeded/Failed (or deletionTimestamp is set and pod is
gone); only then run ExecOnNodeWithChroot (the removeCmd) and call createTestPod
for testPod2. Ensure the wait has a reasonable timeout and fail the test if the
pod does not terminate in time.

In `@test/extended/node/node_utils.go`:
- Around line 950-959: waitForDaemonSetReady currently always uses
oc.Namespace() when calling
oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(...), which can
poll the wrong DaemonSet when createTestDaemonSet created resources in a
different namespace; change the helper to accept an explicit namespace parameter
(e.g., add ns string to waitForDaemonSetReady signature) and use
oc.AdminKubeClient().AppsV1().DaemonSets(ns).Get(ctx, name, metav1.GetOptions{})
instead, and apply the same namespace-parameter fix to the other occurrence
referenced (lines 1125-1158) so callers pass the intended namespace rather than
relying on oc.Namespace().
- Around line 1085-1092: In waitForPodDeleted, non-NotFound GET errors are being
swallowed (returning false, nil) which masks RBAC/transport/apiserver failures;
update the closure in waitForPodDeleted to return the actual error (return
false, err) for any err that is not apierrors.IsNotFound(err) so polling will
surface and abort on real API errors while still returning (true, nil) when
apierrors.IsNotFound(err).

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 79-89: The test currently only logs when an invalid
ContainerRuntimeConfig is accepted (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
cleanupContainerRuntimeConfig) instead of failing; update each negative case
(including the blocks around the ranges noted) to assert failure when the API
accepts an invalid config by calling t.Fatalf or framework.Failf (so the test
fails immediately) and, for cases that expect MCO-level rejection (e.g., TC2),
add the additional wait/assert logic to verify the MCO validation rejects the
config (polling the MCO status or relevant resource and asserting the expected
failure) rather than just logging “checking MCO validation”; ensure each branch
that currently only logs on API acceptance is replaced with an explicit test
failure or a follow-up MCO validation assertion referencing tc.path, the Create
call, and cleanupContainerRuntimeConfig to locate the affected code.

In `@test/extended/node/additional_image_stores.go`:
- Around line 80-89: The test currently only logs when invalid
ContainerRuntimeConfig paths are accepted by the API (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and never
fails the test if MCO validation doesn't set ContainerRuntimeConfigFailure;
update the test logic so that for each tc.path: if Create returns an error,
assert/log that as expected, but if Create succeeds (and you defer
cleanupContainerRuntimeConfig), actively poll the MCO status for the created
config (check for the ContainerRuntimeConfigFailure condition on the
corresponding object) within a timeout and fail the test if the failure
condition is not observed; apply this change to the other similar blocks
referenced (the ranges around lines 119-140, 169-177, 247-255, 284-291, 318-325)
so all “should reject” cases assert MCO-level rejection rather than only logging
acceptance.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 411-425: The per-node checks in Deploy currently swallow
ExecOnNodeWithChroot errors (they call framework.Logf and continue), which can
allow Deploy to report success without verifying all nodes; change the behavior
so that when ExecOnNodeWithChroot returns a non-nil err (both for the "systemctl
is-active" call and the "mount" call) the function returns an error immediately
(e.g., fmt.Errorf) that includes the node name and the original error; update
the logic around ExecOnNodeWithChroot usages in Deploy to fail-fast instead of
continuing so calling code knows verification failed.
- Around line 79-107: The Cleanup method on StargzStoreSetup only deletes the
namespace but must also undo host-level changes applied during deployment:
update Cleanup to iterate the affected nodes and idempotently stop/disable the
stargz-store systemd unit, remove the installed binary and config files, unmount
any mounts created, revert any CRI-O config changes (and reload/restart
CRI-O/kubelet as needed), and remove any sidecar/static-pod artifacts; use the
StargzStoreSetup.Cleanup function to call helper routines (or add new helpers)
that run the required commands via the existing cluster exec/ssh mechanism
(referencing s.namespace, s.oc, s.deployed and the Cleanup method) and ensure
each step logs failures but continues so cleanup is best-effort and safe to
re-run.
- Around line 217-220: Replace the external GitHub curl download (the curl -L -o
/tmp/stargz.tar.gz and the hardcoded
https://github.com/.../stargz-snapshotter-%s-linux-amd64.tar.gz URL and the echo
"Downloading stargz-snapshotter %s...") with a cluster-local/prepackaged
artifact lookup: read the stargz-snapshotter tarball from the test artifact
store or a local registry (e.g., a helper like
getLocalArtifact("stargz-snapshotter", version) or the test fixture artifact
directory) and write it to /tmp/stargz.tar.gz, failing fast if the artifact is
missing; ensure any code that expects the filename
stargz-snapshotter-%s-linux-amd64.tar.gz still works with the local artifact
source.
- Around line 48-75: The current flow only sets s.deployed = true after all
verification succeeds, so partial failures (namespace/configmap/daemonset
created but verification failed) leave resources in cluster because IsDeployed()
returns false; change the logic to mark the deployment as present as soon as
cluster-affecting resources are created: set s.deployed = true immediately after
successful createDaemonSet (or right after createConfigMap/createDaemonSet
sequence) and before calling waitForDaemonSetReady/verifyStargzStoreRunning, and
mirror the same change in the similar block around lines 79-83 so callers
invoking IsDeployed() can trigger cleanup when verification fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5c9d8fe-95ec-4b74-8c24-f0b1dcb4c3b2

📥 Commits

Reviewing files that changed from the base of the PR and between f09da80 and af7b257.

📒 Files selected for processing (7)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/testsuites/standard_suites.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go

Comment on lines +540 to +545
g.By("PHASE 6: Verifying CRI-O config file is removed after CRC deletion")
for _, node := range pureWorkers {
_, err := ExecOnNodeWithChroot(oc, node.Name, "cat", "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores")
o.Expect(err).To(o.HaveOccurred(),
"CRI-O config file should be removed after ContainerRuntimeConfig deletion on node %s", node.Name)
framework.Logf("Node %s: CRI-O config file removed successfully", node.Name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert that the file is absent, not just that cat failed.

Any ExecOnNodeWithChroot(...) failure satisfies this expectation today, so a broken debug session or transient node error looks identical to successful cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" check with a command that succeeds but only prints
when the file exists so we can assert absence specifically: use
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "if [ -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores ]; then echo present;
fi") and then assert the command returned no error and produced an empty stdout
(meaning the file is absent); keep framework.Logf("Node %s: CRI-O config file
removed successfully", node.Name) and update the expectation to check output ==
"" rather than just expecting ExecOnNodeWithChroot to fail.

Comment on lines +567 to +570
workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
o.Expect(err).NotTo(o.HaveOccurred())
pureWorkers := getPureWorkerNodes(workerNodes)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

These cases don't have the SNO/compact guard used in TC10. If getPureWorkerNodes(...) returns empty, the node-verification loops never run and the specs can pass without checking a worker.

Suggested fix
 		workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
 		o.Expect(err).NotTo(o.HaveOccurred())
 		pureWorkers := getPureWorkerNodes(workerNodes)
+		if len(pureWorkers) < 1 {
+			e2eskipper.Skipf("Need at least 1 worker node for this test")
+		}

Also applies to: 656-658

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 567 - 570,
After calling getPureWorkerNodes(workerNodes) check whether pureWorkers is empty
and skip the spec if so; specifically, add a guard like if len(pureWorkers) == 0
{ o.Skip("no pure worker nodes available; skipping spec") } immediately after
the pureWorkers := getPureWorkerNodes(workerNodes) line in the block that uses
getNodesByLabel and getPureWorkerNodes (this prevents the node-verification
loops from being vacuously successful). Apply the same guard to the other
occurrence where getPureWorkerNodes is used (the block around the other call at
the later occurrence).

Comment on lines +550 to +576
g.By("Verifying pod events and image pull behavior")
events, err := oc.AdminKubeClient().CoreV1().Events(oc.Namespace()).List(ctx, metav1.ListOptions{
FieldSelector: fmt.Sprintf("involvedObject.name=%s", testPod.Name),
})
o.Expect(err).NotTo(o.HaveOccurred())

var foundAlreadyPresentEvent bool
for _, event := range events.Items {
if event.Reason == "Pulled" {
framework.Logf("Image pulled event: %s", event.Message)
// Check if the event indicates image was already present on machine
// This is the expected message when image is loaded from additionalImageStores
if strings.Contains(event.Message, "already present on machine and can be accessed by the pod") {
foundAlreadyPresentEvent = true
framework.Logf("SUCCESS: Image was loaded from additionalImageStore - event message: %s", event.Message)
break
}
}
}

// Validation: Verify image was loaded from additional storage
// Expected event message: "Container image ... already present on machine and can be accessed by the pod"
o.Expect(foundAlreadyPresentEvent).To(o.BeTrue(),
"Image should have been loaded from additionalImageStore (%s). "+
"Expected event message containing 'already present on machine and can be accessed by the pod' but did not find it. "+
"This indicates the image was not pre-populated correctly or not loaded from additional storage.", additionalImageStorePath)
framework.Logf("Verified: Image was loaded from additional storage at %s", additionalImageStorePath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll for the pull event instead of reading events once.

Event recording is asynchronous. Listing events immediately after waitForPodRunning(...) can miss the Pulled message and make both the additional-store and fallback assertions flaky.

Also applies to: 607-629

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 550 - 576, The
test reads Events once right after waitForPodRunning(...) which can miss the
asynchronous "Pulled" event; change the logic around
oc.AdminKubeClient().CoreV1().Events(...).List(...) and the
foundAlreadyPresentEvent check to poll until a timeout (e.g., using
framework/poll or wait.PollImmediate) and retry listing events and scanning for
event.Reason == "Pulled" and the "already present on machine and can be accessed
by the pod" substring, failing only after the timeout; apply the same polling
fix to the other similar block that checks events (the block around the second
occurrence noted in the comment, currently at lines 607-629) so both checks
reliably wait for the Pulled event before asserting using
additionalImageStorePath and testPod.

Comment on lines +583 to +599
g.By("Deleting first pod")
err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("First pod deleted")

g.By("Removing image from additional storage to test fallback")
removeCmd := fmt.Sprintf("podman --root %s rmi %s", additionalImageStorePath, testImageDefault)
removeOutput, err := ExecOnNodeWithChroot(oc, testNode, "sh", "-c", removeCmd)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("Image removed from additional storage: %s", removeOutput)

g.By("Creating second pod to test fallback to registry")
testPod2 := createTestPod("imagestore-fallback-pod", testImageDefault, testNode)
startTime2 := time.Now()
_, err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Create(ctx, testPod2, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod2.Name, metav1.DeleteOptions{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the first pod to terminate before removing the image and creating the fallback pod.

Delete is asynchronous. Right now Phase 5 can race a terminating first pod that is still using the image, which makes the podman rmi step and the fallback behavior nondeterministic.

Suggested fix
 		g.By("Deleting first pod")
 		err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
 		o.Expect(err).NotTo(o.HaveOccurred())
+		err = waitForPodDeleted(ctx, oc, testPod.Name, 2*time.Minute)
+		o.Expect(err).NotTo(o.HaveOccurred())
 		framework.Logf("First pod deleted")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
g.By("Deleting first pod")
err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("First pod deleted")
g.By("Removing image from additional storage to test fallback")
removeCmd := fmt.Sprintf("podman --root %s rmi %s", additionalImageStorePath, testImageDefault)
removeOutput, err := ExecOnNodeWithChroot(oc, testNode, "sh", "-c", removeCmd)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("Image removed from additional storage: %s", removeOutput)
g.By("Creating second pod to test fallback to registry")
testPod2 := createTestPod("imagestore-fallback-pod", testImageDefault, testNode)
startTime2 := time.Now()
_, err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Create(ctx, testPod2, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod2.Name, metav1.DeleteOptions{})
g.By("Deleting first pod")
err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
err = waitForPodDeleted(ctx, oc, testPod.Name, 2*time.Minute)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("First pod deleted")
g.By("Removing image from additional storage to test fallback")
removeCmd := fmt.Sprintf("podman --root %s rmi %s", additionalImageStorePath, testImageDefault)
removeOutput, err := ExecOnNodeWithChroot(oc, testNode, "sh", "-c", removeCmd)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("Image removed from additional storage: %s", removeOutput)
g.By("Creating second pod to test fallback to registry")
testPod2 := createTestPod("imagestore-fallback-pod", testImageDefault, testNode)
startTime2 := time.Now()
_, err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Create(ctx, testPod2, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod2.Name, metav1.DeleteOptions{})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 583 - 599, After
calling Delete on testPod, wait for the pod to be fully removed (or to reach a
terminal state) before running the podman rmi and creating the fallback pod:
poll with wait.PollImmediate (k8s.io/apimachinery/pkg/util/wait) against
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions{}) and loop until the Get returns a NotFound error or the
pod.Status.Phase is Succeeded/Failed (or deletionTimestamp is set and pod is
gone); only then run ExecOnNodeWithChroot (the removeCmd) and call createTestPod
for testPod2. Ensure the wait has a reasonable timeout and fail the test if the
pod does not terminate in time.

Comment on lines +631 to +644
g.By("Verifying performance improvement with additionalImageStores")
framework.Logf("Performance comparison:")
framework.Logf(" Pod 1 (prepopulated from additionalImageStore): %v", podStartupTime)
framework.Logf(" Pod 2 (pulled from registry): %v", pod2StartupTime)
speedup := float64(pod2StartupTime) / float64(podStartupTime)
framework.Logf(" Speedup: %.2fx faster with additionalImageStores", speedup)

// Verify that prepopulated image is significantly faster than registry pull
// For a 6GB image, prepopulated should be at least 2x faster
o.Expect(podStartupTime).To(o.BeNumerically("<", pod2StartupTime/2),
"Pod using prepopulated image from additionalImageStore should be significantly faster. "+
"Expected pod1 (%v) to be at least 2x faster than pod2 (%v) pulling from registry.",
podStartupTime, pod2StartupTime)
framework.Logf("Performance improvement verified: Prepopulated image is %.2fx faster", speedup)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The hard 2x timing assertion is likely to be noisy in CI.

The event checks already validate the source of the image. Requiring a strict wall-clock speedup couples the test to registry performance and cluster load, which can fail even when the feature works.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 631 - 644, The
hard 2x timing assertion using podStartupTime and pod2StartupTime is too
brittle; change the Expect that currently uses o.BeNumerically("<",
pod2StartupTime/2) to a relaxed check (e.g., o.BeNumerically("<",
pod2StartupTime)) or remove the numeric timing assertion entirely and rely on
the existing event/source checks for additionalImageStores; also update the
failure message near the Expect and the Logf that prints "Speedup" to avoid
implying a strict 2x requirement.

Comment thread test/extended/node/additional_image_stores.go
Comment on lines +950 to +959
func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, name string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 10*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(ctx, name, metav1.GetOptions{})
if err != nil {
return false, nil
}
return ds.Status.NumberReady == ds.Status.DesiredNumberScheduled &&
ds.Status.DesiredNumberScheduled > 0, nil
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the DaemonSet namespace when polling readiness.

createTestDaemonSet(...) lets callers choose a namespace, but waitForDaemonSetReady(...) always reads from oc.Namespace(). If those differ, this helper waits on the wrong DaemonSet and can fail or pass for unrelated resources.

Suggested fix
-func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, name string, timeout time.Duration) error {
+func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace, name string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 10*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(ctx, name, metav1.GetOptions{})
+		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(ctx, name, metav1.GetOptions{})
 		if err != nil {
 			return false, nil
 		}

Also applies to: 1125-1158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/node_utils.go` around lines 950 - 959,
waitForDaemonSetReady currently always uses oc.Namespace() when calling
oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(...), which can
poll the wrong DaemonSet when createTestDaemonSet created resources in a
different namespace; change the helper to accept an explicit namespace parameter
(e.g., add ns string to waitForDaemonSetReady signature) and use
oc.AdminKubeClient().AppsV1().DaemonSets(ns).Get(ctx, name, metav1.GetOptions{})
instead, and apply the same namespace-parameter fix to the other occurrence
referenced (lines 1125-1158) so callers pass the intended namespace rather than
relying on oc.Namespace().

Comment on lines +1085 to +1092
func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
return false, nil
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't turn pod GET failures into silent timeouts.

Any non-NotFound error is currently returned as false, nil, so RBAC, transport, or apiserver failures get masked until the poll times out.

Suggested fix
 func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
 		_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
 		if apierrors.IsNotFound(err) {
 			return true, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return false, nil
 	})
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
return false, nil
})
func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
if err != nil {
return false, err
}
return false, nil
})
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/node_utils.go` around lines 1085 - 1092, In
waitForPodDeleted, non-NotFound GET errors are being swallowed (returning false,
nil) which masks RBAC/transport/apiserver failures; update the closure in
waitForPodDeleted to return the actual error (return false, err) for any err
that is not apierrors.IsNotFound(err) so polling will surface and abort on real
API errors while still returning (true, nil) when apierrors.IsNotFound(err).

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@BhargaviGudi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

This adds API validation and E2E tests for the Additional Storage
Configurations feature (TechPreviewNoUpgrade):
- additionalArtifactStores (max 10 stores)
- additionalImageStores (max 10 stores)
- additionalLayerStores (max 5 stores, supports :ref suffix)
- Combined storage configurations
Test suite: openshift/additional-storage-support

Signed-off-by: Bhargavi Gudi <BhargaviGudi@users.noreply.github.com>

Add comprehensive test cases for Additional Storage Support feature

Signed-off-by: Bhargavi Gudi <BhargaviGudi@users.noreply.github.com>

Updated testcases
@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from af7b257 to 990e5f3 Compare May 5, 2026 08:43
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BhargaviGudi
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (8)
test/extended/node/node_utils.go (2)

1085-1092: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't swallow non-NotFound pod GET errors.

Line 1091 returns false, nil for every error besides IsNotFound, so real API failures only show up as a timeout instead of failing the poll immediately.

Suggested fix
 func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
 		_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
 		if apierrors.IsNotFound(err) {
 			return true, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return false, nil
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 1085 - 1092, The poll in
function waitForPodDeleted is swallowing non-NotFound GET errors; modify the
closure used by wait.PollUntilContextTimeout so that after calling
oc.AdminKubeClient().CoreV1().Pods(...).Get(...) you check
apierrors.IsNotFound(err) -> return true,nil, else if err != nil -> return
false, err (so the poll fails fast with the actual API error), and only return
false,nil when err == nil and the pod still exists; update the closure around
the Get call in waitForPodDeleted accordingly.

950-958: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll the intended DaemonSet and surface real GET failures.

Line 952 hardcodes oc.Namespace() even though createTestDaemonSet(...) lets callers choose a namespace, so this helper can wait on the wrong DaemonSet. Lines 953-954 also turn every API error into a silent timeout, which makes RBAC/apiserver failures look like readiness flakes.

Suggested fix
-func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, name string, timeout time.Duration) error {
+func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace, name string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 10*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(ctx, name, metav1.GetOptions{})
-		if err != nil {
+		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(ctx, name, metav1.GetOptions{})
+		if apierrors.IsNotFound(err) {
 			return false, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return ds.Status.NumberReady == ds.Status.DesiredNumberScheduled &&
 			ds.Status.DesiredNumberScheduled > 0, nil
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 950 - 958, The helper
waitForDaemonSetReady currently always polls oc.Namespace() and silences GET
errors; change its signature to accept an explicit namespace (e.g.
waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace string,
name string, timeout time.Duration)) then call
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) so it polls the
intended DaemonSet created by createTestDaemonSet, and do not swallow API
errors—return the actual err from the GET (return false, err) instead of
converting all errors into silent timeouts; update all call sites to pass the
correct namespace.
test/extended/node/additional_layer_stores.go (3)

460-462: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid a personal Quay image here.

Line 461 hardcodes quay.io/bgudi/test-small:estargz, which makes this suite depend on a non-repository-owned external image. If that tag is removed, permissions change, or mirroring rules differ, CI breaks for reasons unrelated to the code under test. Prefer a repository-managed image routed through image.LocationFor(...) and registered in the allowed image list.

#!/bin/bash
set -euo pipefail

rg -n 'quay\.io/bgudi/test-small:estargz' test/extended/node
fd -i 'image.go' test/extended/util | xargs rg -n 'allowedImages|bgudi/test-small|estargz'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 460 - 462,
Replace the hardcoded external image value assigned to eStargzImage
("quay.io/bgudi/test-small:estargz") with a repository-managed image lookup
using image.LocationFor(...) (or another test helper that returns an allowed
image), and ensure the chosen image is registered in the allowed image list used
by the test suite; update the assignment where eStargzImage is defined and any
test harness references so they use the LocationFor result rather than the
personal quay URL.

207-213: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail TC5 if the invalid path is accepted.

Lines 207-213 only log a warning on a successful create, so this spec still passes if invalid-character validation regresses.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 207 - 213, The
test currently only logs a warning when an invalid path is accepted, so add a
failing assertion instead: after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
succeeds, call the test-failure helper (e.g. framework.Failf or t.Fatalf
depending on test harness) with a clear message including tc.char and
ctrcfg.Name rather than just logging; keep the existing
cleanupContainerRuntimeConfig defer and replace the framework.Logf warning
branch with a failure call to ensure the spec fails when invalid paths are
accepted.

744-748: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the worker MCP to leave the old steady state before waitForMCP(...).

These paths jump straight from waitForContainerRuntimeConfigSuccess(...) to waitForMCP(...). If the pool is still Updated when polling starts, waitForMCP(...) can return on the previous rendered config and race the later storage.conf assertions.

Also applies to: 770-774, 840-844

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 744 - 748, After
waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, ...), ensure
the worker MachineConfigPool leaves its previous steady state before calling
waitForMCP(...): insert a short wait that polls the "worker" MCP until its
status no longer reports the prior steady-state condition (e.g., not
"Updated"/not matching the old rendered configuration) — you can reuse the
existing waitForMCP helper with a flag/variant or add a small helper (e.g.,
waitForMCPToLeaveSteadyState) to poll the "worker" pool and return only when it
has transitioned away from the old steady state, then proceed to the existing
waitForMCP(ctx, mcClient, "worker", 25*time.Minute); apply the same insertion at
the other occurrences noted (around the blocks at the other two locations).
test/extended/node/additional_artifact_stores.go (3)

567-575: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

After getPureWorkerNodes(...), both specs continue with an empty slice. On SNO/compact clusters that makes the node-verification loops vacuously succeed and the test can pass without checking a worker at all.

Also applies to: 656-664

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 567 - 575, The
test continues even when getPureWorkerNodes(...) returns an empty slice, causing
vacuous success on SNO/compact clusters; add an early skip when pureWorkers is
empty by checking len(pureWorkers)==0 and invoking the test skip helper (e.g.,
o.Skip or t.Skip) with a clear message before calling createDirectoriesOnNodes;
apply the same guard at the other occurrence around the block referencing
createDirectoriesOnNodes/cleanupDirectoriesOnNodes (the second instance around
lines 656-664) so both specs bail out when no pure workers exist.

79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make these negative-path specs fail when invalid configs are accepted.

Lines 79-85 and the similar blocks only log success when the API accepts an invalid ContainerRuntimeConfig, so the spec still goes green on a validation regression. TC2 has the same problem after the poll: if ObservedGeneration == Generation without ever reaching ContainerRuntimeConfigFailure=True, the test still passes.

Also applies to: 118-140, 168-174, 246-253, 283-290, 317-324

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 85, The
test currently treats API acceptance of invalid ContainerRuntimeConfig as
success; change the Create() success branch (the block around
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the defer
cleanupContainerRuntimeConfig call) to explicitly fail the test when an invalid
config is accepted (use the test framework's failure/assert method instead of
only logging), and likewise update the poll/validation checks that compare
ObservedGeneration to Generation to fail if ObservedGeneration==Generation
occurs without ever observing the ContainerRuntimeConfigFailure=True condition
(i.e., after your existing poll of the resource state, assert failure when no
failure condition was ever set). Target the create/accept branches and the poll
that checks ObservedGeneration/Generation and ContainerRuntimeConfigFailure to
make negative-path specs fail on invalid-config acceptance.

540-545: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert file absence explicitly instead of expecting cat to fail.

Any ExecOnNodeWithChroot(...) failure satisfies this check today, so a broken debug session is indistinguishable from successful cleanup. Check for existence with a shell test and assert empty output instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" failure check with an explicit existence test: call
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "test -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores && echo exists") for
each node in pureWorkers, capture the stdout, and assert that the output is
empty (e.g. o.Expect(out).To(o.BeEmpty(), "CRI-O config file should be removed
after ContainerRuntimeConfig deletion on node %s", node.Name)); leave the
framework.Logf success message unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 507-514: After successfully creating the ContainerRuntimeConfig
(the call to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{})), immediately schedule its cleanup by calling defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) right after the
o.Expect(err).NotTo(o.HaveOccurred()) that follows the create; this ensures the
ctrcfg (variable ctrcfg) is removed on test exit even if subsequent assertions
(including waitForContainerRuntimeConfigSuccess) fail.
- Around line 712-719: The test currently proceeds even when
getPureWorkerNodes(...) returns an empty slice, letting directory
setup/verification be no-ops and falsely passing; fix by checking
len(pureWorkers) immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) and, if zero, skip the spec (or return) with a
clear message so the createDirectoriesOnNodes and subsequent verification loops
are not run; apply the same guard in both places where pureWorkers is used (the
block around createDirectoriesOnNodes and the later verification loop) and
reference getPureWorkerNodes, pureWorkers, and createDirectoriesOnNodes when
adding the check.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-575: The test continues even when getPureWorkerNodes(...)
returns an empty slice, causing vacuous success on SNO/compact clusters; add an
early skip when pureWorkers is empty by checking len(pureWorkers)==0 and
invoking the test skip helper (e.g., o.Skip or t.Skip) with a clear message
before calling createDirectoriesOnNodes; apply the same guard at the other
occurrence around the block referencing
createDirectoriesOnNodes/cleanupDirectoriesOnNodes (the second instance around
lines 656-664) so both specs bail out when no pure workers exist.
- Around line 79-85: The test currently treats API acceptance of invalid
ContainerRuntimeConfig as success; change the Create() success branch (the block
around mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
the defer cleanupContainerRuntimeConfig call) to explicitly fail the test when
an invalid config is accepted (use the test framework's failure/assert method
instead of only logging), and likewise update the poll/validation checks that
compare ObservedGeneration to Generation to fail if
ObservedGeneration==Generation occurs without ever observing the
ContainerRuntimeConfigFailure=True condition (i.e., after your existing poll of
the resource state, assert failure when no failure condition was ever set).
Target the create/accept branches and the poll that checks
ObservedGeneration/Generation and ContainerRuntimeConfigFailure to make
negative-path specs fail on invalid-config acceptance.
- Around line 540-545: Replace the fragile "cat" failure check with an explicit
existence test: call ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "test -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores && echo exists") for
each node in pureWorkers, capture the stdout, and assert that the output is
empty (e.g. o.Expect(out).To(o.BeEmpty(), "CRI-O config file should be removed
after ContainerRuntimeConfig deletion on node %s", node.Name)); leave the
framework.Logf success message unchanged.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 460-462: Replace the hardcoded external image value assigned to
eStargzImage ("quay.io/bgudi/test-small:estargz") with a repository-managed
image lookup using image.LocationFor(...) (or another test helper that returns
an allowed image), and ensure the chosen image is registered in the allowed
image list used by the test suite; update the assignment where eStargzImage is
defined and any test harness references so they use the LocationFor result
rather than the personal quay URL.
- Around line 207-213: The test currently only logs a warning when an invalid
path is accepted, so add a failing assertion instead: after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
succeeds, call the test-failure helper (e.g. framework.Failf or t.Fatalf
depending on test harness) with a clear message including tc.char and
ctrcfg.Name rather than just logging; keep the existing
cleanupContainerRuntimeConfig defer and replace the framework.Logf warning
branch with a failure call to ensure the spec fails when invalid paths are
accepted.
- Around line 744-748: After waitForContainerRuntimeConfigSuccess(ctx, mcClient,
ctrcfg.Name, ...), ensure the worker MachineConfigPool leaves its previous
steady state before calling waitForMCP(...): insert a short wait that polls the
"worker" MCP until its status no longer reports the prior steady-state condition
(e.g., not "Updated"/not matching the old rendered configuration) — you can
reuse the existing waitForMCP helper with a flag/variant or add a small helper
(e.g., waitForMCPToLeaveSteadyState) to poll the "worker" pool and return only
when it has transitioned away from the old steady state, then proceed to the
existing waitForMCP(ctx, mcClient, "worker", 25*time.Minute); apply the same
insertion at the other occurrences noted (around the blocks at the other two
locations).

In `@test/extended/node/node_utils.go`:
- Around line 1085-1092: The poll in function waitForPodDeleted is swallowing
non-NotFound GET errors; modify the closure used by wait.PollUntilContextTimeout
so that after calling oc.AdminKubeClient().CoreV1().Pods(...).Get(...) you check
apierrors.IsNotFound(err) -> return true,nil, else if err != nil -> return
false, err (so the poll fails fast with the actual API error), and only return
false,nil when err == nil and the pod still exists; update the closure around
the Get call in waitForPodDeleted accordingly.
- Around line 950-958: The helper waitForDaemonSetReady currently always polls
oc.Namespace() and silences GET errors; change its signature to accept an
explicit namespace (e.g. waitForDaemonSetReady(ctx context.Context, oc
*exutil.CLI, namespace string, name string, timeout time.Duration)) then call
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) so it polls the
intended DaemonSet created by createTestDaemonSet, and do not swallow API
errors—return the actual err from the GET (return false, err) instead of
converting all errors into silent timeouts; update all call sites to pass the
correct namespace.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8bbb8226-1e43-42a4-8e65-d214abf75a43

📥 Commits

Reviewing files that changed from the base of the PR and between af7b257 and 990e5f3.

📒 Files selected for processing (7)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go
✅ Files skipped from review due to trivial changes (2)
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/stargz_store_setup.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/node/additional_image_stores.go

Comment on lines +507 to +514
_, err = mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("ContainerRuntimeConfig %s created with path: %s", ctrcfg.Name, stargzSetup.GetStorePath())

g.By("Waiting for ContainerRuntimeConfig to be processed by MCO")
err = waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, 5*time.Minute)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("ContainerRuntimeConfig processed by MCO")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Defer CRC cleanup immediately after create.

This spec deletes the ContainerRuntimeConfig only in Phase 11. If any assertion fails earlier, the worker pool stays mutated and can poison the rest of the suite. Add defer cleanupContainerRuntimeConfig(...) right after the successful create.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 507 - 514, After
successfully creating the ContainerRuntimeConfig (the call to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{})), immediately schedule its cleanup by calling defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) right after the
o.Expect(err).NotTo(o.HaveOccurred()) that follows the create; this ensures the
ctrcfg (variable ctrcfg) is removed on test exit even if subsequent assertions
(including waitForContainerRuntimeConfigSuccess) fail.

Comment on lines +712 to +719
workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
o.Expect(err).NotTo(o.HaveOccurred())
pureWorkers := getPureWorkerNodes(workerNodes)

g.By("Creating shared layer directories on worker nodes")
layerDirs := []string{"/var/lib/layerstore-1", "/var/lib/layerstore-2", "/var/lib/layerstore-3"}
err = createDirectoriesOnNodes(oc, pureWorkers, layerDirs)
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip when getPureWorkerNodes(...) returns none.

TC12 and TC13 do not guard the pureWorkers slice. On SNO/compact clusters, the directory setup and verification loops become no-ops, so the specs can pass without validating a worker node.

Also applies to: 798-812

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 712 - 719, The
test currently proceeds even when getPureWorkerNodes(...) returns an empty
slice, letting directory setup/verification be no-ops and falsely passing; fix
by checking len(pureWorkers) immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) and, if zero, skip the spec (or return) with a
clear message so the createDirectoriesOnNodes and subsequent verification loops
are not run; apply the same guard in both places where pureWorkers is used (the
block around createDirectoriesOnNodes and the later verification loop) and
reference getPureWorkerNodes, pureWorkers, and createDirectoriesOnNodes when
adding the check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants