ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing by dustman9000 · Pull Request #80232 · openshift/release

dustman9000 · 2026-06-08T15:35:17Z

Summary

Add step registry refs for pool-based cluster checkout/checkin, replacing 30+ minute ephemeral cluster provisioning with ~1 minute checkout from pre-provisioned ROSA Classic STS clusters
New rosa-pool-checkout ref uses ConfigMap-based lease tracking with optimistic concurrency (CAS via oc replace) for safe concurrent claims
New rosa-pool-checkin ref returns clusters to the pool (best-effort, health check recovers stale leases)
New rosa-pool-health ref for periodic cluster health checks and stale lease recovery
New rosa-pool-e2e-workflow workflow: checkout -> operator-install -> e2e -> cleanup -> checkin
Provision helper script for initial cluster creation and ConfigMap registration

This is the step registry foundation for cluster standby pools. Infrastructure setup (namespace + RBAC on app.ci, pool cluster provisioning) tracked separately.

Jira: https://redhat.atlassian.net/browse/ROSAENG-59268

Test plan

YAML syntax validated locally (all files pass yaml.safe_load)
Shell syntax validated locally (all scripts pass bash -n)
make registry-metadata succeeds (metadata files auto-generated)
pj-rehearse validates step registry changes
End-to-end validation after infrastructure setup (tracked in Jira)

A detailed high-level summary could not be generated for this review. Here is an overview derived from the analyzed file changes:

ci-operator/step-registry/rosa/pool/OWNERS: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkin/OWNERS: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.sh: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.metadata.json: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.yaml: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkout/OWNERS: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.metadata.json: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.yaml: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/e2e-workflow/OWNERS: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.metadata.json: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.yaml: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/health/OWNERS: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.metadata.json: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.yaml: ## AI-generated summary of changes
ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh: ## AI-generated summary of changes

Add step registry refs for pool-based cluster checkout/checkin, replacing 30+ minute ephemeral cluster provisioning with ~1 minute checkout from pre-provisioned clusters. New components: - rosa-pool-checkout: CAS-based ConfigMap claim with wait loop - rosa-pool-checkin: best-effort cluster return - rosa-pool-health: periodic health check and stale lease recovery - rosa-pool-e2e-workflow: pool variant of operator e2e workflow - rosa-pool-provision.sh: helper for initial cluster provisioning Jira: https://redhat.atlassian.net/browse/ROSAENG-59268

openshift-ci-robot · 2026-06-08T15:35:21Z

@dustman9000: This pull request references ROSAENG-59268 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add step registry refs for pool-based cluster checkout/checkin, replacing 30+ minute ephemeral cluster provisioning with ~1 minute checkout from pre-provisioned ROSA Classic STS clusters

New rosa-pool-checkout ref uses ConfigMap-based lease tracking with optimistic concurrency (CAS via oc replace) for safe concurrent claims

New rosa-pool-checkin ref returns clusters to the pool (best-effort, health check recovers stale leases)

New rosa-pool-health ref for periodic cluster health checks and stale lease recovery

New rosa-pool-e2e-workflow workflow: checkout -> operator-install -> e2e -> cleanup -> checkin

Provision helper script for initial cluster creation and ConfigMap registration

This is the step registry foundation for cluster standby pools. Infrastructure setup (namespace + RBAC on app.ci, pool cluster provisioning) tracked separately.

Jira: https://redhat.atlassian.net/browse/ROSAENG-59268

Test plan

YAML syntax validated locally (all files pass yaml.safe_load)

Shell syntax validated locally (all scripts pass bash -n)

make registry-metadata succeeds (metadata files auto-generated)

pj-rehearse validates step registry changes

End-to-end validation after infrastructure setup (tracked in Jira)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-08T15:35:42Z

Walkthrough

This PR adds CI step registry infrastructure for ROSA pool management comprising five components: pool checkout (atomic ConfigMap-based lease claiming with OCM/backplane auth), pool checkin (returning clusters to available state), pool health monitoring (OCM status, stale lease recovery, backplane connectivity), pool provisioning (cluster creation and registration), and E2E workflow composition orchestrating the full lifecycle.

Changes

ROSA Pool Lifecycle Operations

Layer / File(s)	Summary
Pool claim and checkout `ci-operator/step-registry/rosa/pool/checkout/OWNERS`, `ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh`, `ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.{metadata.json,yaml}`	Checkout implements atomic ConfigMap-based lease claiming with timeout/retry, OCM authentication, backplane kubeconfig retrieval, and cluster access verification for pool-allocated ROSA clusters.
Pool return and checkin `ci-operator/step-registry/rosa/pool/checkin/OWNERS`, `ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.sh`, `ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.{metadata.json,yaml}`	Checkin returns checked-out clusters to the pool by patching ConfigMap status to available, clearing holder/build-id annotations, and recording released-at timestamp.
Pool health monitoring and recovery `ci-operator/step-registry/rosa/pool/health/OWNERS`, `ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh`, `ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.{metadata.json,yaml}`	Health monitoring queries OCM readiness status, checks backplane connectivity and node readiness, detects and force-recovers stale leases via ConfigMap patching, manages error state transitions, and writes summary report.
Pool provisioning and registration `ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh`	Provision script creates new ROSA clusters via rosa-cli (when not register-only), queries OCM for cluster metadata, computes version labels, and registers clusters into the pool namespace by applying ConfigMap manifests with status fields.
E2E workflow composition and root OWNERS `ci-operator/step-registry/rosa/pool/OWNERS`, `ci-operator/step-registry/rosa/pool/e2e-workflow/OWNERS`, `ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.{metadata.json,yaml}`	E2E workflow composes checkout/install/test/cleanup/checkin steps into a parameterized test workflow with default pool/OCM configuration and environment variable documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

lgtm, rehearsals-ack

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
No-Sensitive-Data-In-Logs	❌ Error	Scripts expose API tokens and secrets as command-line arguments to `ocm login`, visible in process logs and CI output.	Pass credentials via OCM_TOKEN environment variable or stdin instead of --token/--client-secret command-line arguments.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a ROSA cluster standby pool for operator e2e testing, which is the central objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR contains no Ginkgo test files or test code; check for stable test names is not applicable to infrastructure configuration and bash scripts only.
Test Structure And Quality	✅ Passed	This PR contains no Ginkgo test code to review. It adds infrastructure automation (OWNERS files, YAML configurations, and bash scripts) for ROSA cluster pool management, not Go/Ginkgo tests.
Microshift Test Compatibility	✅ Passed	PR adds no Ginkgo e2e tests—only CI step registry configs, helper scripts, and metadata files. The check targeting MicroShift test compatibility does not apply.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds only CI/CD infrastructure (bash scripts, YAML configs, metadata) for ROSA pool management, no Ginkgo e2e tests. Check not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds CI step registry references, workflows, and bash scripts for ROSA pool management. No Kubernetes manifests with scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	PR adds shell scripts for CI infrastructure (pool checkout/checkin/health), not Go test binaries. OTE Binary Stdout Contract applies only to Go test binaries integrating with openshift-tests.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. Changes consist only of bash scripts, YAML CI step registry configurations, and OWNERS files. The custom check is not applicable.
No-Weak-Crypto	✅ Passed	No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found in the PR files.
Container-Privileges	✅ Passed	No container manifests with privilege escalation flags found. PR adds ci-operator step refs and bash scripts only; no privileged/hostPID/hostNetwork/SYS_ADMIN settings present.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-08T15:35:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/rosa/OWNERS~~ [dustman9000]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-bot · 2026-06-08T15:38:06Z

@dustman9000, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not load step registry: invalid file name: /var/tmp/gitrepo2906168756/ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

openshift-ci · 2026-06-08T15:41:58Z

@dustman9000: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/owners	`09da796`	link	true	`/test owners`
ci/prow/ci-operator-registry	`09da796`	link	true	`/test ci-operator-registry`
ci/prow/ci-operator-config	`09da796`	link	true	`/test ci-operator-config`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh`:
- Around line 149-163: The kubeconfig fallback can authenticate to a different
cluster; after copying BACKPLANE_KUBECONFIG (or the fallback
${HOME}/.kube/config) to "${SHARED_DIR}/kubeconfig", explicitly verify the
target matches the expected cluster before proceeding: set
KUBECONFIG="${SHARED_DIR}/kubeconfig", extract the server/cluster identifier
from the produced kubeconfig (e.g. via oc whoami --show-server or oc config
view) and compare it to the expected server/cluster derived from
BACKPLANE_KUBECONFIG (or an explicit expected variable); if they differ, log an
error and exit instead of continuing to run E2E commands. Ensure this check is
performed prior to the existing oc whoami/oc get nodes verification.

In `@ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh`:
- Around line 128-150: After a successful ocm backplane login, the script
currently treats a missing BACKPLANE_KC or a failing oc node query as healthy
and may abort the run due to errexit/pipefail; update the branch that defines
BACKPLANE_KC to explicitly check [[ -f "${BACKPLANE_KC}" ]] and if missing mark
the pool as UNHEALTHY (increment UNHEALTHY, log UNHEALTHY message, patch
configmap as in the Ready=0 case, and continue), and change the NODE_COUNT and
READY_NODES retrieval to guard against oc failures (run oc in an if/then or
append || true and test command exit status) so a failing oc get nodes does not
exit the script but instead triggers the same UNHEALTHY handling used when
READY_NODES is 0; ensure references to BACKPLANE_KC, NODE_COUNT, READY_NODES,
and the ocm backplane login branch are used to locate the changes.

In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh`:
- Around line 125-130: The current ocm query masks real failures because
CLUSTER_JSON is assigned with "ocm list clusters ... || true", causing
network/auth errors to appear as "cluster not found"; remove the "|| true" and
instead check the ocm command exit status after running the command (the
invocation around CLUSTER_JSON and the ocm list clusters call), log the actual
error output and exit non‑zero on failures, then only treat an empty CLUSTER_ID
(the jq result assigned to CLUSTER_ID) as the true "cluster not found" case for
POOL_NAME; ensure logs include the failing command, its stderr, and the
POOL_NAME to aid debugging.
- Around line 86-87: The current FULL_VERSION assignment uses string sorting of
.raw_id which picks e.g. 4.22.9 over 4.22.10; update the jq sort to perform
version-aware numeric sorting by splitting .raw_id into components and mapping
them to numbers (e.g., sort_by(.raw_id | split(".") | map(tonumber))) before
taking last .raw_id so the latest patch is chosen; adjust the pipeline around
the rosa list versions invocation (the command that sets FULL_VERSION) to use
this numeric/version-aware sort while keeping the same filters for CHANNEL_GROUP
and VERSION.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 66cf22e6-806a-49b4-8e1c-cb934fbf1c35

📥 Commits

Reviewing files that changed from the base of the PR and between 92a93e9 and 09da796.

📒 Files selected for processing (17)

ci-operator/step-registry/rosa/pool/OWNERS
ci-operator/step-registry/rosa/pool/checkin/OWNERS
ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.sh
ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.metadata.json
ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.yaml
ci-operator/step-registry/rosa/pool/checkout/OWNERS
ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh
ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.metadata.json
ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.yaml
ci-operator/step-registry/rosa/pool/e2e-workflow/OWNERS
ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.metadata.json
ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.yaml
ci-operator/step-registry/rosa/pool/health/OWNERS
ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh
ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.metadata.json
ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.yaml
ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh

coderabbitai · 2026-06-08T15:46:14Z

+    cp "${BACKPLANE_KUBECONFIG}" "${SHARED_DIR}/kubeconfig"
+elif [[ -f "${HOME}/.kube/config" ]]; then
+    cp "${HOME}/.kube/config" "${SHARED_DIR}/kubeconfig"
+else
+    log "ERROR: No kubeconfig produced by backplane login"
+    exit 1
+fi
+
+# Verify cluster access
+export KUBECONFIG="${SHARED_DIR}/kubeconfig"
+if oc whoami &>/dev/null; then
+    log "Verified cluster access: $(oc whoami --show-server)"
+    log "Nodes: $(oc get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')"
+else
+    log "WARNING: Could not verify cluster access (oc whoami failed)"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate kubeconfig target matches the claimed cluster before continuing.

Line 150-Line 163 can succeed with a kubeconfig that authenticates but points at the wrong cluster (fallback path). That can run E2E actions against an unintended target.

Suggested fix

# Verify cluster access export KUBECONFIG="${SHARED_DIR}/kubeconfig" if oc whoami &>/dev/null; then - log "Verified cluster access: $(oc whoami --show-server)" + CURRENT_SERVER="$(oc whoami --show-server)" + EXPECTED_SERVER="$(cat "${SHARED_DIR}/api-url" 2>/dev/null || true)" + if [[ -n "${EXPECTED_SERVER}" && "${CURRENT_SERVER}" != "${EXPECTED_SERVER}" ]]; then + log "ERROR: Kubeconfig server mismatch. Expected ${EXPECTED_SERVER}, got ${CURRENT_SERVER}" + exit 1 + fi + log "Verified cluster access: ${CURRENT_SERVER}" log "Nodes: $(oc get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')" else log "WARNING: Could not verify cluster access (oc whoami failed)" fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh` around lines 149 - 163, The kubeconfig fallback can authenticate to a different cluster; after copying BACKPLANE_KUBECONFIG (or the fallback ${HOME}/.kube/config) to "${SHARED_DIR}/kubeconfig", explicitly verify the target matches the expected cluster before proceeding: set KUBECONFIG="${SHARED_DIR}/kubeconfig", extract the server/cluster identifier from the produced kubeconfig (e.g. via oc whoami --show-server or oc config view) and compare it to the expected server/cluster derived from BACKPLANE_KUBECONFIG (or an explicit expected variable); if they differ, log an error and exit instead of continuing to run E2E commands. Ensure this check is performed prior to the existing oc whoami/oc get nodes verification.

coderabbitai · 2026-06-08T15:46:15Z

+    if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then
+        BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config"
+        if [[ -f "${BACKPLANE_KC}" ]]; then
+            NODE_COUNT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
+            READY_NODES=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | grep -c " Ready" || echo "0")
+
+            echo "  OCM status: ready" >> "${REPORT}"
+            echo "  Nodes: ${READY_NODES}/${NODE_COUNT} ready" >> "${REPORT}"
+
+            if [[ "${READY_NODES}" -eq 0 ]]; then
+                log "UNHEALTHY: ${CM_NAME} has no ready nodes"
+                if [[ "${STATUS}" != "error" ]]; then
+                    pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{
+                        "metadata": {
+                            "labels": { "rosa-pool/status": "error" },
+                            "annotations": { "rosa-pool/error-reason": "No ready nodes" }
+                        }
+                    }' || true
+                fi
+                UNHEALTHY=$((UNHEALTHY + 1))
+                continue
+            fi
+        fi


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh" echo "== File header / shell options ==" sed -n '1,220p' "$FILE" | nl -ba | sed -n '1,220p' echo echo "== Focused region around 110-190 ==" sed -n '100,200p' "$FILE" | nl -ba echo echo "== Find occurrences ==" rg -n "errexit|pipefail|UNHEALTHY|STATUS|continue|pool_oc patch|Nodes:" "$FILE" || true echo echo "== Outer loop / health classification context ==" # Try to capture a bit more around where UNHEALTHY/STATUS are set and where final health is decided sed -n '200,340p' "$FILE" | nl -ba

Repository: openshift/release

Length of output: 156

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh" echo "== File header / shell options ==" sed -n '1,120p' "$FILE" | cat -n echo echo "== Focused region around 110-190 ==" sed -n '100,220p' "$FILE" | cat -n echo echo "== Find occurrences ==" rg -n "errexit|pipefail|UNHEALTHY|STATUS|continue|pool_oc patch|Nodes:" "$FILE" || true echo echo "== Outer loop / health classification context ==" sed -n '220,420p' "$FILE" | cat -n

Repository: openshift/release

Length of output: 12198

Fix backplane health checks to treat missing kubeconfig and node-query failures as UNHEALTHY (avoid aborting the run)

When ocm backplane login succeeds but ${HOME}/.kube/backplane/${CLUSTER_ID}/config is missing, the logic falls through and the cluster is counted as HEALTHY.

With set -o errexit + set -o pipefail, NODE_COUNT=$(oc ... get nodes ... | wc -l ...) will terminate the whole script if the oc node listing fails.

Suggested fix

- if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then - BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config" - if [[ -f "${BACKPLANE_KC}" ]]; then - NODE_COUNT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ') - READY_NODES=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | grep -c " Ready" || echo "0") + if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then + BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config" + if [[ ! -f "${BACKPLANE_KC}" ]]; then + log "UNHEALTHY: ${CM_NAME} backplane kubeconfig not found" + echo " Backplane: kubeconfig missing (UNHEALTHY)" >> "${REPORT}" + if [[ "${STATUS}" != "error" ]]; then + pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{ + "metadata": { + "labels": { "rosa-pool/status": "error" }, + "annotations": { "rosa-pool/error-reason": "Backplane kubeconfig missing" } + } + }' + fi + UNHEALTHY=$((UNHEALTHY + 1)) + continue + fi + + if ! NODES_OUTPUT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null); then + log "UNHEALTHY: ${CM_NAME} failed to query nodes" + echo " Nodes: query failed (UNHEALTHY)" >> "${REPORT}" + if [[ "${STATUS}" != "error" ]]; then + pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{ + "metadata": { + "labels": { "rosa-pool/status": "error" }, + "annotations": { "rosa-pool/error-reason": "Node query failed" } + } + }' + fi + UNHEALTHY=$((UNHEALTHY + 1)) + continue + fi + + NODE_COUNT=$(printf "%s\n" "${NODES_OUTPUT}" | sed '/^$/d' | wc -l | tr -d ' ') + READY_NODES=$(printf "%s\n" "${NODES_OUTPUT}" | grep -c " Ready" || true) echo " OCM status: ready" >> "${REPORT}" echo " Nodes: ${READY_NODES}/${NODE_COUNT} ready" >> "${REPORT}" @@ - fi else

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh` around lines 128 - 150, After a successful ocm backplane login, the script currently treats a missing BACKPLANE_KC or a failing oc node query as healthy and may abort the run due to errexit/pipefail; update the branch that defines BACKPLANE_KC to explicitly check [[ -f "${BACKPLANE_KC}" ]] and if missing mark the pool as UNHEALTHY (increment UNHEALTHY, log UNHEALTHY message, patch configmap as in the Ready=0 case, and continue), and change the NODE_COUNT and READY_NODES retrieval to guard against oc failures (run oc in an if/then or append || true and test command exit status) so a failing oc get nodes does not exit the script but instead triggers the same UNHEALTHY handling used when READY_NODES is 0; ensure references to BACKPLANE_KC, NODE_COUNT, READY_NODES, and the ocm backplane login branch are used to locate the changes.

coderabbitai · 2026-06-08T15:46:15Z

+    FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null | \
+        jq -r '[.[] | select(.raw_id | startswith("'"${VERSION}"'")) | select(.enabled == true)] | sort_by(.raw_id) | last | .raw_id')


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use version-aware sorting when selecting the latest ROSA version.

At Line 86–87, sorting by .raw_id as a plain string can choose the wrong patch (e.g., 4.22.9 over 4.22.10).

Suggested fix

- FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null | \ - jq -r '[.[] | select(.raw_id | startswith("'"${VERSION}"'")) | select(.enabled == true)] | sort_by(.raw_id) | last | .raw_id') + FULL_VERSION=$( + rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null \ + | jq -r '.[] | select(.enabled == true) | select(.raw_id | startswith("'"${VERSION}"'")) | .raw_id' \ + | sort -V \ + | tail -n1 + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh` around lines 86 - 87, The current FULL_VERSION assignment uses string sorting of .raw_id which picks e.g. 4.22.9 over 4.22.10; update the jq sort to perform version-aware numeric sorting by splitting .raw_id into components and mapping them to numbers (e.g., sort_by(.raw_id | split(".") | map(tonumber))) before taking last .raw_id so the latest patch is chosen; adjust the pipeline around the rosa list versions invocation (the command that sets FULL_VERSION) to use this numeric/version-aware sort while keeping the same filters for CHANNEL_GROUP and VERSION.

coderabbitai · 2026-06-08T15:46:15Z

+CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true)
+CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')
+
+if [[ -z "${CLUSTER_ID}" ]]; then
+    log "ERROR: Cluster ${POOL_NAME} not found in OCM"
+    exit 1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not mask OCM query failures as “cluster not found.”

At Line 125, || true suppresses real API/auth/network errors and turns them into a misleading not-found path at Line 129.

Suggested fix

-CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true) +if ! CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null); then + log "ERROR: Failed to query OCM for cluster ${POOL_NAME}" + exit 1 +fi CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true)

CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')

if [[ -z "${CLUSTER_ID}" ]]; then

log "ERROR: Cluster ${POOL_NAME} not found in OCM"

exit 1

if ! CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null); then

log "ERROR: Failed to query OCM for cluster ${POOL_NAME}"

exit 1

fi

CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')

if [[ -z "${CLUSTER_ID}" ]]; then

log "ERROR: Cluster ${POOL_NAME} not found in OCM"

exit 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh` around lines 125 - 130, The current ocm query masks real failures because CLUSTER_JSON is assigned with "ocm list clusters ... || true", causing network/auth errors to appear as "cluster not found"; remove the "|| true" and instead check the ocm command exit status after running the command (the invocation around CLUSTER_JSON and the ocm list clusters call), log the actual error output and exit non‑zero on failures, then only treat an empty CLUSTER_ID (the jq result assigned to CLUSTER_ID) as the true "cluster not found" case for POOL_NAME; ensure logs include the failing command, its stderr, and the POOL_NAME to aid debugging.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2026

openshift-ci Bot requested review from amandahla and jtaleric June 8, 2026 15:35

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2026

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing#80232

ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing#80232
dustman9000 wants to merge 1 commit into
openshift:mainfrom
dustman9000:rosa-pool-standby

dustman9000 commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented Jun 8, 2026 •

edited by atlassian Bot

Loading

Summary

Test plan

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Pre-merge checks failed

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 8, 2026

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null \| \
		jq -r '[.[] \| select(.raw_id \| startswith("'"${VERSION}"'")) \| select(.enabled == true)] \| sort_by(.raw_id) \| last \| .raw_id')

Conversation

dustman9000 commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

openshift-ci-robot commented Jun 8, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 8, 2026

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dustman9000 commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 8, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading