Skip to content

ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing#80232

Open
dustman9000 wants to merge 1 commit into
openshift:mainfrom
dustman9000:rosa-pool-standby
Open

ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing#80232
dustman9000 wants to merge 1 commit into
openshift:mainfrom
dustman9000:rosa-pool-standby

Conversation

@dustman9000

@dustman9000 dustman9000 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

  • Add step registry refs for pool-based cluster checkout/checkin, replacing 30+ minute ephemeral cluster provisioning with ~1 minute checkout from pre-provisioned ROSA Classic STS clusters
  • New rosa-pool-checkout ref uses ConfigMap-based lease tracking with optimistic concurrency (CAS via oc replace) for safe concurrent claims
  • New rosa-pool-checkin ref returns clusters to the pool (best-effort, health check recovers stale leases)
  • New rosa-pool-health ref for periodic cluster health checks and stale lease recovery
  • New rosa-pool-e2e-workflow workflow: checkout -> operator-install -> e2e -> cleanup -> checkin
  • Provision helper script for initial cluster creation and ConfigMap registration

This is the step registry foundation for cluster standby pools. Infrastructure setup (namespace + RBAC on app.ci, pool cluster provisioning) tracked separately.

Jira: https://redhat.atlassian.net/browse/ROSAENG-59268

Test plan

  • YAML syntax validated locally (all files pass yaml.safe_load)
  • Shell syntax validated locally (all scripts pass bash -n)
  • make registry-metadata succeeds (metadata files auto-generated)
  • pj-rehearse validates step registry changes
  • End-to-end validation after infrastructure setup (tracked in Jira)

A detailed high-level summary could not be generated for this review. Here is an overview derived from the analyzed file changes:

  • ci-operator/step-registry/rosa/pool/OWNERS: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkin/OWNERS: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.sh: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.metadata.json: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.yaml: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkout/OWNERS: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.metadata.json: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.yaml: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/e2e-workflow/OWNERS: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.metadata.json: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.yaml: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/health/OWNERS: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.metadata.json: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.yaml: ## AI-generated summary of changes
  • ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh: ## AI-generated summary of changes

Add step registry refs for pool-based cluster checkout/checkin,
replacing 30+ minute ephemeral cluster provisioning with ~1 minute
checkout from pre-provisioned clusters.

New components:
- rosa-pool-checkout: CAS-based ConfigMap claim with wait loop
- rosa-pool-checkin: best-effort cluster return
- rosa-pool-health: periodic health check and stale lease recovery
- rosa-pool-e2e-workflow: pool variant of operator e2e workflow
- rosa-pool-provision.sh: helper for initial cluster provisioning

Jira: https://redhat.atlassian.net/browse/ROSAENG-59268
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@dustman9000: This pull request references ROSAENG-59268 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add step registry refs for pool-based cluster checkout/checkin, replacing 30+ minute ephemeral cluster provisioning with ~1 minute checkout from pre-provisioned ROSA Classic STS clusters
  • New rosa-pool-checkout ref uses ConfigMap-based lease tracking with optimistic concurrency (CAS via oc replace) for safe concurrent claims
  • New rosa-pool-checkin ref returns clusters to the pool (best-effort, health check recovers stale leases)
  • New rosa-pool-health ref for periodic cluster health checks and stale lease recovery
  • New rosa-pool-e2e-workflow workflow: checkout -> operator-install -> e2e -> cleanup -> checkin
  • Provision helper script for initial cluster creation and ConfigMap registration

This is the step registry foundation for cluster standby pools. Infrastructure setup (namespace + RBAC on app.ci, pool cluster provisioning) tracked separately.

Jira: https://redhat.atlassian.net/browse/ROSAENG-59268

Test plan

  • YAML syntax validated locally (all files pass yaml.safe_load)
  • Shell syntax validated locally (all scripts pass bash -n)
  • make registry-metadata succeeds (metadata files auto-generated)
  • pj-rehearse validates step registry changes
  • End-to-end validation after infrastructure setup (tracked in Jira)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR adds CI step registry infrastructure for ROSA pool management comprising five components: pool checkout (atomic ConfigMap-based lease claiming with OCM/backplane auth), pool checkin (returning clusters to available state), pool health monitoring (OCM status, stale lease recovery, backplane connectivity), pool provisioning (cluster creation and registration), and E2E workflow composition orchestrating the full lifecycle.

Changes

ROSA Pool Lifecycle Operations

Layer / File(s) Summary
Pool claim and checkout
ci-operator/step-registry/rosa/pool/checkout/OWNERS, ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh, ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.{metadata.json,yaml}
Checkout implements atomic ConfigMap-based lease claiming with timeout/retry, OCM authentication, backplane kubeconfig retrieval, and cluster access verification for pool-allocated ROSA clusters.
Pool return and checkin
ci-operator/step-registry/rosa/pool/checkin/OWNERS, ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.sh, ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.{metadata.json,yaml}
Checkin returns checked-out clusters to the pool by patching ConfigMap status to available, clearing holder/build-id annotations, and recording released-at timestamp.
Pool health monitoring and recovery
ci-operator/step-registry/rosa/pool/health/OWNERS, ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh, ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.{metadata.json,yaml}
Health monitoring queries OCM readiness status, checks backplane connectivity and node readiness, detects and force-recovers stale leases via ConfigMap patching, manages error state transitions, and writes summary report.
Pool provisioning and registration
ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh
Provision script creates new ROSA clusters via rosa-cli (when not register-only), queries OCM for cluster metadata, computes version labels, and registers clusters into the pool namespace by applying ConfigMap manifests with status fields.
E2E workflow composition and root OWNERS
ci-operator/step-registry/rosa/pool/OWNERS, ci-operator/step-registry/rosa/pool/e2e-workflow/OWNERS, ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.{metadata.json,yaml}
E2E workflow composes checkout/install/test/cleanup/checkin steps into a parameterized test workflow with default pool/OCM configuration and environment variable documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

lgtm, rehearsals-ack


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Scripts expose API tokens and secrets as command-line arguments to ocm login, visible in process logs and CI output. Pass credentials via OCM_TOKEN environment variable or stdin instead of --token/--client-secret command-line arguments.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a ROSA cluster standby pool for operator e2e testing, which is the central objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo test files or test code; check for stable test names is not applicable to infrastructure configuration and bash scripts only.
Test Structure And Quality ✅ Passed This PR contains no Ginkgo test code to review. It adds infrastructure automation (OWNERS files, YAML configurations, and bash scripts) for ROSA cluster pool management, not Go/Ginkgo tests.
Microshift Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests—only CI step registry configs, helper scripts, and metadata files. The check targeting MicroShift test compatibility does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds only CI/CD infrastructure (bash scripts, YAML configs, metadata) for ROSA pool management, no Ginkgo e2e tests. Check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds CI step registry references, workflows, and bash scripts for ROSA pool management. No Kubernetes manifests with scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR adds shell scripts for CI infrastructure (pool checkout/checkin/health), not Go test binaries. OTE Binary Stdout Contract applies only to Go test binaries integrating with openshift-tests.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Changes consist only of bash scripts, YAML CI step registry configurations, and OWNERS files. The custom check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found in the PR files.
Container-Privileges ✅ Passed No container manifests with privilege escalation flags found. PR adds ci-operator step refs and bash scripts only; no privileged/hostPID/hostNetwork/SYS_ADMIN settings present.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from amandahla and jtaleric June 8, 2026 15:35
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@dustman9000, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not load step registry: invalid file name: /var/tmp/gitrepo2906168756/ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@dustman9000: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/owners 09da796 link true /test owners
ci/prow/ci-operator-registry 09da796 link true /test ci-operator-registry
ci/prow/ci-operator-config 09da796 link true /test ci-operator-config

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh`:
- Around line 149-163: The kubeconfig fallback can authenticate to a different
cluster; after copying BACKPLANE_KUBECONFIG (or the fallback
${HOME}/.kube/config) to "${SHARED_DIR}/kubeconfig", explicitly verify the
target matches the expected cluster before proceeding: set
KUBECONFIG="${SHARED_DIR}/kubeconfig", extract the server/cluster identifier
from the produced kubeconfig (e.g. via oc whoami --show-server or oc config
view) and compare it to the expected server/cluster derived from
BACKPLANE_KUBECONFIG (or an explicit expected variable); if they differ, log an
error and exit instead of continuing to run E2E commands. Ensure this check is
performed prior to the existing oc whoami/oc get nodes verification.

In `@ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh`:
- Around line 128-150: After a successful ocm backplane login, the script
currently treats a missing BACKPLANE_KC or a failing oc node query as healthy
and may abort the run due to errexit/pipefail; update the branch that defines
BACKPLANE_KC to explicitly check [[ -f "${BACKPLANE_KC}" ]] and if missing mark
the pool as UNHEALTHY (increment UNHEALTHY, log UNHEALTHY message, patch
configmap as in the Ready=0 case, and continue), and change the NODE_COUNT and
READY_NODES retrieval to guard against oc failures (run oc in an if/then or
append || true and test command exit status) so a failing oc get nodes does not
exit the script but instead triggers the same UNHEALTHY handling used when
READY_NODES is 0; ensure references to BACKPLANE_KC, NODE_COUNT, READY_NODES,
and the ocm backplane login branch are used to locate the changes.

In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh`:
- Around line 125-130: The current ocm query masks real failures because
CLUSTER_JSON is assigned with "ocm list clusters ... || true", causing
network/auth errors to appear as "cluster not found"; remove the "|| true" and
instead check the ocm command exit status after running the command (the
invocation around CLUSTER_JSON and the ocm list clusters call), log the actual
error output and exit non‑zero on failures, then only treat an empty CLUSTER_ID
(the jq result assigned to CLUSTER_ID) as the true "cluster not found" case for
POOL_NAME; ensure logs include the failing command, its stderr, and the
POOL_NAME to aid debugging.
- Around line 86-87: The current FULL_VERSION assignment uses string sorting of
.raw_id which picks e.g. 4.22.9 over 4.22.10; update the jq sort to perform
version-aware numeric sorting by splitting .raw_id into components and mapping
them to numbers (e.g., sort_by(.raw_id | split(".") | map(tonumber))) before
taking last .raw_id so the latest patch is chosen; adjust the pipeline around
the rosa list versions invocation (the command that sets FULL_VERSION) to use
this numeric/version-aware sort while keeping the same filters for CHANNEL_GROUP
and VERSION.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 66cf22e6-806a-49b4-8e1c-cb934fbf1c35

📥 Commits

Reviewing files that changed from the base of the PR and between 92a93e9 and 09da796.

📒 Files selected for processing (17)
  • ci-operator/step-registry/rosa/pool/OWNERS
  • ci-operator/step-registry/rosa/pool/checkin/OWNERS
  • ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.sh
  • ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.metadata.json
  • ci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.yaml
  • ci-operator/step-registry/rosa/pool/checkout/OWNERS
  • ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh
  • ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.metadata.json
  • ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.yaml
  • ci-operator/step-registry/rosa/pool/e2e-workflow/OWNERS
  • ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.metadata.json
  • ci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.yaml
  • ci-operator/step-registry/rosa/pool/health/OWNERS
  • ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh
  • ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.metadata.json
  • ci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.yaml
  • ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh

Comment on lines +149 to +163
cp "${BACKPLANE_KUBECONFIG}" "${SHARED_DIR}/kubeconfig"
elif [[ -f "${HOME}/.kube/config" ]]; then
cp "${HOME}/.kube/config" "${SHARED_DIR}/kubeconfig"
else
log "ERROR: No kubeconfig produced by backplane login"
exit 1
fi

# Verify cluster access
export KUBECONFIG="${SHARED_DIR}/kubeconfig"
if oc whoami &>/dev/null; then
log "Verified cluster access: $(oc whoami --show-server)"
log "Nodes: $(oc get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')"
else
log "WARNING: Could not verify cluster access (oc whoami failed)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate kubeconfig target matches the claimed cluster before continuing.

Line 150-Line 163 can succeed with a kubeconfig that authenticates but points at the wrong cluster (fallback path). That can run E2E actions against an unintended target.

Suggested fix
 # Verify cluster access
 export KUBECONFIG="${SHARED_DIR}/kubeconfig"
 if oc whoami &>/dev/null; then
-    log "Verified cluster access: $(oc whoami --show-server)"
+    CURRENT_SERVER="$(oc whoami --show-server)"
+    EXPECTED_SERVER="$(cat "${SHARED_DIR}/api-url" 2>/dev/null || true)"
+    if [[ -n "${EXPECTED_SERVER}" && "${CURRENT_SERVER}" != "${EXPECTED_SERVER}" ]]; then
+        log "ERROR: Kubeconfig server mismatch. Expected ${EXPECTED_SERVER}, got ${CURRENT_SERVER}"
+        exit 1
+    fi
+    log "Verified cluster access: ${CURRENT_SERVER}"
     log "Nodes: $(oc get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')"
 else
     log "WARNING: Could not verify cluster access (oc whoami failed)"
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh`
around lines 149 - 163, The kubeconfig fallback can authenticate to a different
cluster; after copying BACKPLANE_KUBECONFIG (or the fallback
${HOME}/.kube/config) to "${SHARED_DIR}/kubeconfig", explicitly verify the
target matches the expected cluster before proceeding: set
KUBECONFIG="${SHARED_DIR}/kubeconfig", extract the server/cluster identifier
from the produced kubeconfig (e.g. via oc whoami --show-server or oc config
view) and compare it to the expected server/cluster derived from
BACKPLANE_KUBECONFIG (or an explicit expected variable); if they differ, log an
error and exit instead of continuing to run E2E commands. Ensure this check is
performed prior to the existing oc whoami/oc get nodes verification.

Comment on lines +128 to +150
if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then
BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config"
if [[ -f "${BACKPLANE_KC}" ]]; then
NODE_COUNT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
READY_NODES=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | grep -c " Ready" || echo "0")

echo " OCM status: ready" >> "${REPORT}"
echo " Nodes: ${READY_NODES}/${NODE_COUNT} ready" >> "${REPORT}"

if [[ "${READY_NODES}" -eq 0 ]]; then
log "UNHEALTHY: ${CM_NAME} has no ready nodes"
if [[ "${STATUS}" != "error" ]]; then
pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{
"metadata": {
"labels": { "rosa-pool/status": "error" },
"annotations": { "rosa-pool/error-reason": "No ready nodes" }
}
}' || true
fi
UNHEALTHY=$((UNHEALTHY + 1))
continue
fi
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh"

echo "== File header / shell options =="
sed -n '1,220p' "$FILE" | nl -ba | sed -n '1,220p'

echo
echo "== Focused region around 110-190 =="
sed -n '100,200p' "$FILE" | nl -ba

echo
echo "== Find occurrences =="
rg -n "errexit|pipefail|UNHEALTHY|STATUS|continue|pool_oc patch|Nodes:" "$FILE" || true

echo
echo "== Outer loop / health classification context =="
# Try to capture a bit more around where UNHEALTHY/STATUS are set and where final health is decided
sed -n '200,340p' "$FILE" | nl -ba

Repository: openshift/release

Length of output: 156


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh"

echo "== File header / shell options =="
sed -n '1,120p' "$FILE" | cat -n

echo
echo "== Focused region around 110-190 =="
sed -n '100,220p' "$FILE" | cat -n

echo
echo "== Find occurrences =="
rg -n "errexit|pipefail|UNHEALTHY|STATUS|continue|pool_oc patch|Nodes:" "$FILE" || true

echo
echo "== Outer loop / health classification context =="
sed -n '220,420p' "$FILE" | cat -n

Repository: openshift/release

Length of output: 12198


Fix backplane health checks to treat missing kubeconfig and node-query failures as UNHEALTHY (avoid aborting the run)

  • When ocm backplane login succeeds but ${HOME}/.kube/backplane/${CLUSTER_ID}/config is missing, the logic falls through and the cluster is counted as HEALTHY.
  • With set -o errexit + set -o pipefail, NODE_COUNT=$(oc ... get nodes ... | wc -l ...) will terminate the whole script if the oc node listing fails.
Suggested fix
-    if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then
-        BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config"
-        if [[ -f "${BACKPLANE_KC}" ]]; then
-            NODE_COUNT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
-            READY_NODES=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | grep -c " Ready" || echo "0")
+    if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then
+        BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config"
+        if [[ ! -f "${BACKPLANE_KC}" ]]; then
+            log "UNHEALTHY: ${CM_NAME} backplane kubeconfig not found"
+            echo "  Backplane: kubeconfig missing (UNHEALTHY)" >> "${REPORT}"
+            if [[ "${STATUS}" != "error" ]]; then
+                pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{
+                    "metadata": {
+                        "labels": { "rosa-pool/status": "error" },
+                        "annotations": { "rosa-pool/error-reason": "Backplane kubeconfig missing" }
+                    }
+                }'
+            fi
+            UNHEALTHY=$((UNHEALTHY + 1))
+            continue
+        fi
+
+        if ! NODES_OUTPUT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null); then
+            log "UNHEALTHY: ${CM_NAME} failed to query nodes"
+            echo "  Nodes: query failed (UNHEALTHY)" >> "${REPORT}"
+            if [[ "${STATUS}" != "error" ]]; then
+                pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{
+                    "metadata": {
+                        "labels": { "rosa-pool/status": "error" },
+                        "annotations": { "rosa-pool/error-reason": "Node query failed" }
+                    }
+                }'
+            fi
+            UNHEALTHY=$((UNHEALTHY + 1))
+            continue
+        fi
+
+        NODE_COUNT=$(printf "%s\n" "${NODES_OUTPUT}" | sed '/^$/d' | wc -l | tr -d ' ')
+        READY_NODES=$(printf "%s\n" "${NODES_OUTPUT}" | grep -c " Ready" || true)
 
             echo "  OCM status: ready" >> "${REPORT}"
             echo "  Nodes: ${READY_NODES}/${NODE_COUNT} ready" >> "${REPORT}"
@@
-        fi
     else
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh`
around lines 128 - 150, After a successful ocm backplane login, the script
currently treats a missing BACKPLANE_KC or a failing oc node query as healthy
and may abort the run due to errexit/pipefail; update the branch that defines
BACKPLANE_KC to explicitly check [[ -f "${BACKPLANE_KC}" ]] and if missing mark
the pool as UNHEALTHY (increment UNHEALTHY, log UNHEALTHY message, patch
configmap as in the Ready=0 case, and continue), and change the NODE_COUNT and
READY_NODES retrieval to guard against oc failures (run oc in an if/then or
append || true and test command exit status) so a failing oc get nodes does not
exit the script but instead triggers the same UNHEALTHY handling used when
READY_NODES is 0; ensure references to BACKPLANE_KC, NODE_COUNT, READY_NODES,
and the ocm backplane login branch are used to locate the changes.

Comment on lines +86 to +87
FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null | \
jq -r '[.[] | select(.raw_id | startswith("'"${VERSION}"'")) | select(.enabled == true)] | sort_by(.raw_id) | last | .raw_id')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use version-aware sorting when selecting the latest ROSA version.

At Line 86–87, sorting by .raw_id as a plain string can choose the wrong patch (e.g., 4.22.9 over 4.22.10).

Suggested fix
-    FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null | \
-        jq -r '[.[] | select(.raw_id | startswith("'"${VERSION}"'")) | select(.enabled == true)] | sort_by(.raw_id) | last | .raw_id')
+    FULL_VERSION=$(
+        rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null \
+        | jq -r '.[] | select(.enabled == true) | select(.raw_id | startswith("'"${VERSION}"'")) | .raw_id' \
+        | sort -V \
+        | tail -n1
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh` around
lines 86 - 87, The current FULL_VERSION assignment uses string sorting of
.raw_id which picks e.g. 4.22.9 over 4.22.10; update the jq sort to perform
version-aware numeric sorting by splitting .raw_id into components and mapping
them to numbers (e.g., sort_by(.raw_id | split(".") | map(tonumber))) before
taking last .raw_id so the latest patch is chosen; adjust the pipeline around
the rosa list versions invocation (the command that sets FULL_VERSION) to use
this numeric/version-aware sort while keeping the same filters for CHANNEL_GROUP
and VERSION.

Comment on lines +125 to +130
CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true)
CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')

if [[ -z "${CLUSTER_ID}" ]]; then
log "ERROR: Cluster ${POOL_NAME} not found in OCM"
exit 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not mask OCM query failures as “cluster not found.”

At Line 125, || true suppresses real API/auth/network errors and turns them into a misleading not-found path at Line 129.

Suggested fix
-CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true)
+if ! CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null); then
+    log "ERROR: Failed to query OCM for cluster ${POOL_NAME}"
+    exit 1
+fi
 CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true)
CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')
if [[ -z "${CLUSTER_ID}" ]]; then
log "ERROR: Cluster ${POOL_NAME} not found in OCM"
exit 1
if ! CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null); then
log "ERROR: Failed to query OCM for cluster ${POOL_NAME}"
exit 1
fi
CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')
if [[ -z "${CLUSTER_ID}" ]]; then
log "ERROR: Cluster ${POOL_NAME} not found in OCM"
exit 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh` around
lines 125 - 130, The current ocm query masks real failures because CLUSTER_JSON
is assigned with "ocm list clusters ... || true", causing network/auth errors to
appear as "cluster not found"; remove the "|| true" and instead check the ocm
command exit status after running the command (the invocation around
CLUSTER_JSON and the ocm list clusters call), log the actual error output and
exit non‑zero on failures, then only treat an empty CLUSTER_ID (the jq result
assigned to CLUSTER_ID) as the true "cluster not found" case for POOL_NAME;
ensure logs include the failing command, its stderr, and the POOL_NAME to aid
debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants