ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing#80232
ROSAENG-59268: Add ROSA cluster standby pool for operator e2e testing#80232dustman9000 wants to merge 1 commit into
Conversation
Add step registry refs for pool-based cluster checkout/checkin, replacing 30+ minute ephemeral cluster provisioning with ~1 minute checkout from pre-provisioned clusters. New components: - rosa-pool-checkout: CAS-based ConfigMap claim with wait loop - rosa-pool-checkin: best-effort cluster return - rosa-pool-health: periodic health check and stale lease recovery - rosa-pool-e2e-workflow: pool variant of operator e2e workflow - rosa-pool-provision.sh: helper for initial cluster provisioning Jira: https://redhat.atlassian.net/browse/ROSAENG-59268
|
@dustman9000: This pull request references ROSAENG-59268 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR adds CI step registry infrastructure for ROSA pool management comprising five components: pool checkout (atomic ConfigMap-based lease claiming with OCM/backplane auth), pool checkin (returning clusters to available state), pool health monitoring (OCM status, stale lease recovery, backplane connectivity), pool provisioning (cluster creation and registration), and E2E workflow composition orchestrating the full lifecycle. ChangesROSA Pool Lifecycle Operations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dustman9000 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@dustman9000, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@dustman9000: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh`:
- Around line 149-163: The kubeconfig fallback can authenticate to a different
cluster; after copying BACKPLANE_KUBECONFIG (or the fallback
${HOME}/.kube/config) to "${SHARED_DIR}/kubeconfig", explicitly verify the
target matches the expected cluster before proceeding: set
KUBECONFIG="${SHARED_DIR}/kubeconfig", extract the server/cluster identifier
from the produced kubeconfig (e.g. via oc whoami --show-server or oc config
view) and compare it to the expected server/cluster derived from
BACKPLANE_KUBECONFIG (or an explicit expected variable); if they differ, log an
error and exit instead of continuing to run E2E commands. Ensure this check is
performed prior to the existing oc whoami/oc get nodes verification.
In `@ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh`:
- Around line 128-150: After a successful ocm backplane login, the script
currently treats a missing BACKPLANE_KC or a failing oc node query as healthy
and may abort the run due to errexit/pipefail; update the branch that defines
BACKPLANE_KC to explicitly check [[ -f "${BACKPLANE_KC}" ]] and if missing mark
the pool as UNHEALTHY (increment UNHEALTHY, log UNHEALTHY message, patch
configmap as in the Ready=0 case, and continue), and change the NODE_COUNT and
READY_NODES retrieval to guard against oc failures (run oc in an if/then or
append || true and test command exit status) so a failing oc get nodes does not
exit the script but instead triggers the same UNHEALTHY handling used when
READY_NODES is 0; ensure references to BACKPLANE_KC, NODE_COUNT, READY_NODES,
and the ocm backplane login branch are used to locate the changes.
In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh`:
- Around line 125-130: The current ocm query masks real failures because
CLUSTER_JSON is assigned with "ocm list clusters ... || true", causing
network/auth errors to appear as "cluster not found"; remove the "|| true" and
instead check the ocm command exit status after running the command (the
invocation around CLUSTER_JSON and the ocm list clusters call), log the actual
error output and exit non‑zero on failures, then only treat an empty CLUSTER_ID
(the jq result assigned to CLUSTER_ID) as the true "cluster not found" case for
POOL_NAME; ensure logs include the failing command, its stderr, and the
POOL_NAME to aid debugging.
- Around line 86-87: The current FULL_VERSION assignment uses string sorting of
.raw_id which picks e.g. 4.22.9 over 4.22.10; update the jq sort to perform
version-aware numeric sorting by splitting .raw_id into components and mapping
them to numbers (e.g., sort_by(.raw_id | split(".") | map(tonumber))) before
taking last .raw_id so the latest patch is chosen; adjust the pipeline around
the rosa list versions invocation (the command that sets FULL_VERSION) to use
this numeric/version-aware sort while keeping the same filters for CHANNEL_GROUP
and VERSION.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 66cf22e6-806a-49b4-8e1c-cb934fbf1c35
📒 Files selected for processing (17)
ci-operator/step-registry/rosa/pool/OWNERSci-operator/step-registry/rosa/pool/checkin/OWNERSci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-commands.shci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.metadata.jsonci-operator/step-registry/rosa/pool/checkin/rosa-pool-checkin-ref.yamlci-operator/step-registry/rosa/pool/checkout/OWNERSci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.shci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.metadata.jsonci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-ref.yamlci-operator/step-registry/rosa/pool/e2e-workflow/OWNERSci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.metadata.jsonci-operator/step-registry/rosa/pool/e2e-workflow/rosa-pool-e2e-workflow-workflow.yamlci-operator/step-registry/rosa/pool/health/OWNERSci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.shci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.metadata.jsonci-operator/step-registry/rosa/pool/health/rosa-pool-health-ref.yamlci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh
| cp "${BACKPLANE_KUBECONFIG}" "${SHARED_DIR}/kubeconfig" | ||
| elif [[ -f "${HOME}/.kube/config" ]]; then | ||
| cp "${HOME}/.kube/config" "${SHARED_DIR}/kubeconfig" | ||
| else | ||
| log "ERROR: No kubeconfig produced by backplane login" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Verify cluster access | ||
| export KUBECONFIG="${SHARED_DIR}/kubeconfig" | ||
| if oc whoami &>/dev/null; then | ||
| log "Verified cluster access: $(oc whoami --show-server)" | ||
| log "Nodes: $(oc get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')" | ||
| else | ||
| log "WARNING: Could not verify cluster access (oc whoami failed)" |
There was a problem hiding this comment.
Validate kubeconfig target matches the claimed cluster before continuing.
Line 150-Line 163 can succeed with a kubeconfig that authenticates but points at the wrong cluster (fallback path). That can run E2E actions against an unintended target.
Suggested fix
# Verify cluster access
export KUBECONFIG="${SHARED_DIR}/kubeconfig"
if oc whoami &>/dev/null; then
- log "Verified cluster access: $(oc whoami --show-server)"
+ CURRENT_SERVER="$(oc whoami --show-server)"
+ EXPECTED_SERVER="$(cat "${SHARED_DIR}/api-url" 2>/dev/null || true)"
+ if [[ -n "${EXPECTED_SERVER}" && "${CURRENT_SERVER}" != "${EXPECTED_SERVER}" ]]; then
+ log "ERROR: Kubeconfig server mismatch. Expected ${EXPECTED_SERVER}, got ${CURRENT_SERVER}"
+ exit 1
+ fi
+ log "Verified cluster access: ${CURRENT_SERVER}"
log "Nodes: $(oc get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')"
else
log "WARNING: Could not verify cluster access (oc whoami failed)"
fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-operator/step-registry/rosa/pool/checkout/rosa-pool-checkout-commands.sh`
around lines 149 - 163, The kubeconfig fallback can authenticate to a different
cluster; after copying BACKPLANE_KUBECONFIG (or the fallback
${HOME}/.kube/config) to "${SHARED_DIR}/kubeconfig", explicitly verify the
target matches the expected cluster before proceeding: set
KUBECONFIG="${SHARED_DIR}/kubeconfig", extract the server/cluster identifier
from the produced kubeconfig (e.g. via oc whoami --show-server or oc config
view) and compare it to the expected server/cluster derived from
BACKPLANE_KUBECONFIG (or an explicit expected variable); if they differ, log an
error and exit instead of continuing to run E2E commands. Ensure this check is
performed prior to the existing oc whoami/oc get nodes verification.
| if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then | ||
| BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config" | ||
| if [[ -f "${BACKPLANE_KC}" ]]; then | ||
| NODE_COUNT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ') | ||
| READY_NODES=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | grep -c " Ready" || echo "0") | ||
|
|
||
| echo " OCM status: ready" >> "${REPORT}" | ||
| echo " Nodes: ${READY_NODES}/${NODE_COUNT} ready" >> "${REPORT}" | ||
|
|
||
| if [[ "${READY_NODES}" -eq 0 ]]; then | ||
| log "UNHEALTHY: ${CM_NAME} has no ready nodes" | ||
| if [[ "${STATUS}" != "error" ]]; then | ||
| pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{ | ||
| "metadata": { | ||
| "labels": { "rosa-pool/status": "error" }, | ||
| "annotations": { "rosa-pool/error-reason": "No ready nodes" } | ||
| } | ||
| }' || true | ||
| fi | ||
| UNHEALTHY=$((UNHEALTHY + 1)) | ||
| continue | ||
| fi | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh"
echo "== File header / shell options =="
sed -n '1,220p' "$FILE" | nl -ba | sed -n '1,220p'
echo
echo "== Focused region around 110-190 =="
sed -n '100,200p' "$FILE" | nl -ba
echo
echo "== Find occurrences =="
rg -n "errexit|pipefail|UNHEALTHY|STATUS|continue|pool_oc patch|Nodes:" "$FILE" || true
echo
echo "== Outer loop / health classification context =="
# Try to capture a bit more around where UNHEALTHY/STATUS are set and where final health is decided
sed -n '200,340p' "$FILE" | nl -baRepository: openshift/release
Length of output: 156
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh"
echo "== File header / shell options =="
sed -n '1,120p' "$FILE" | cat -n
echo
echo "== Focused region around 110-190 =="
sed -n '100,220p' "$FILE" | cat -n
echo
echo "== Find occurrences =="
rg -n "errexit|pipefail|UNHEALTHY|STATUS|continue|pool_oc patch|Nodes:" "$FILE" || true
echo
echo "== Outer loop / health classification context =="
sed -n '220,420p' "$FILE" | cat -nRepository: openshift/release
Length of output: 12198
Fix backplane health checks to treat missing kubeconfig and node-query failures as UNHEALTHY (avoid aborting the run)
- When
ocm backplane loginsucceeds but${HOME}/.kube/backplane/${CLUSTER_ID}/configis missing, the logic falls through and the cluster is counted as HEALTHY. - With
set -o errexit+set -o pipefail,NODE_COUNT=$(oc ... get nodes ... | wc -l ...)will terminate the whole script if theocnode listing fails.
Suggested fix
- if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then
- BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config"
- if [[ -f "${BACKPLANE_KC}" ]]; then
- NODE_COUNT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
- READY_NODES=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null | grep -c " Ready" || echo "0")
+ if ocm backplane login "${CLUSTER_ID}" --multi 2>/dev/null; then
+ BACKPLANE_KC="${HOME}/.kube/backplane/${CLUSTER_ID}/config"
+ if [[ ! -f "${BACKPLANE_KC}" ]]; then
+ log "UNHEALTHY: ${CM_NAME} backplane kubeconfig not found"
+ echo " Backplane: kubeconfig missing (UNHEALTHY)" >> "${REPORT}"
+ if [[ "${STATUS}" != "error" ]]; then
+ pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{
+ "metadata": {
+ "labels": { "rosa-pool/status": "error" },
+ "annotations": { "rosa-pool/error-reason": "Backplane kubeconfig missing" }
+ }
+ }'
+ fi
+ UNHEALTHY=$((UNHEALTHY + 1))
+ continue
+ fi
+
+ if ! NODES_OUTPUT=$(oc --kubeconfig="${BACKPLANE_KC}" get nodes --no-headers 2>/dev/null); then
+ log "UNHEALTHY: ${CM_NAME} failed to query nodes"
+ echo " Nodes: query failed (UNHEALTHY)" >> "${REPORT}"
+ if [[ "${STATUS}" != "error" ]]; then
+ pool_oc patch configmap "${CM_NAME}" -n "${POOL_NAMESPACE}" --type merge -p '{
+ "metadata": {
+ "labels": { "rosa-pool/status": "error" },
+ "annotations": { "rosa-pool/error-reason": "Node query failed" }
+ }
+ }'
+ fi
+ UNHEALTHY=$((UNHEALTHY + 1))
+ continue
+ fi
+
+ NODE_COUNT=$(printf "%s\n" "${NODES_OUTPUT}" | sed '/^$/d' | wc -l | tr -d ' ')
+ READY_NODES=$(printf "%s\n" "${NODES_OUTPUT}" | grep -c " Ready" || true)
echo " OCM status: ready" >> "${REPORT}"
echo " Nodes: ${READY_NODES}/${NODE_COUNT} ready" >> "${REPORT}"
@@
- fi
else🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-operator/step-registry/rosa/pool/health/rosa-pool-health-commands.sh`
around lines 128 - 150, After a successful ocm backplane login, the script
currently treats a missing BACKPLANE_KC or a failing oc node query as healthy
and may abort the run due to errexit/pipefail; update the branch that defines
BACKPLANE_KC to explicitly check [[ -f "${BACKPLANE_KC}" ]] and if missing mark
the pool as UNHEALTHY (increment UNHEALTHY, log UNHEALTHY message, patch
configmap as in the Ready=0 case, and continue), and change the NODE_COUNT and
READY_NODES retrieval to guard against oc failures (run oc in an if/then or
append || true and test command exit status) so a failing oc get nodes does not
exit the script but instead triggers the same UNHEALTHY handling used when
READY_NODES is 0; ensure references to BACKPLANE_KC, NODE_COUNT, READY_NODES,
and the ocm backplane login branch are used to locate the changes.
| FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null | \ | ||
| jq -r '[.[] | select(.raw_id | startswith("'"${VERSION}"'")) | select(.enabled == true)] | sort_by(.raw_id) | last | .raw_id') |
There was a problem hiding this comment.
Use version-aware sorting when selecting the latest ROSA version.
At Line 86–87, sorting by .raw_id as a plain string can choose the wrong patch (e.g., 4.22.9 over 4.22.10).
Suggested fix
- FULL_VERSION=$(rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null | \
- jq -r '[.[] | select(.raw_id | startswith("'"${VERSION}"'")) | select(.enabled == true)] | sort_by(.raw_id) | last | .raw_id')
+ FULL_VERSION=$(
+ rosa list versions --channel-group "${CHANNEL_GROUP}" -o json 2>/dev/null \
+ | jq -r '.[] | select(.enabled == true) | select(.raw_id | startswith("'"${VERSION}"'")) | .raw_id' \
+ | sort -V \
+ | tail -n1
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh` around
lines 86 - 87, The current FULL_VERSION assignment uses string sorting of
.raw_id which picks e.g. 4.22.9 over 4.22.10; update the jq sort to perform
version-aware numeric sorting by splitting .raw_id into components and mapping
them to numbers (e.g., sort_by(.raw_id | split(".") | map(tonumber))) before
taking last .raw_id so the latest patch is chosen; adjust the pipeline around
the rosa list versions invocation (the command that sets FULL_VERSION) to use
this numeric/version-aware sort while keeping the same filters for CHANNEL_GROUP
and VERSION.
| CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true) | ||
| CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty') | ||
|
|
||
| if [[ -z "${CLUSTER_ID}" ]]; then | ||
| log "ERROR: Cluster ${POOL_NAME} not found in OCM" | ||
| exit 1 |
There was a problem hiding this comment.
Do not mask OCM query failures as “cluster not found.”
At Line 125, || true suppresses real API/auth/network errors and turns them into a misleading not-found path at Line 129.
Suggested fix
-CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true)
+if ! CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null); then
+ log "ERROR: Failed to query OCM for cluster ${POOL_NAME}"
+ exit 1
+fi
CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null || true) | |
| CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty') | |
| if [[ -z "${CLUSTER_ID}" ]]; then | |
| log "ERROR: Cluster ${POOL_NAME} not found in OCM" | |
| exit 1 | |
| if ! CLUSTER_JSON=$(ocm list clusters --parameter search="name = '${POOL_NAME}'" --json 2>/dev/null); then | |
| log "ERROR: Failed to query OCM for cluster ${POOL_NAME}" | |
| exit 1 | |
| fi | |
| CLUSTER_ID=$(echo "${CLUSTER_JSON}" | jq -r '.items[0].id // empty') | |
| if [[ -z "${CLUSTER_ID}" ]]; then | |
| log "ERROR: Cluster ${POOL_NAME} not found in OCM" | |
| exit 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-operator/step-registry/rosa/pool/provision/rosa-pool-provision.sh` around
lines 125 - 130, The current ocm query masks real failures because CLUSTER_JSON
is assigned with "ocm list clusters ... || true", causing network/auth errors to
appear as "cluster not found"; remove the "|| true" and instead check the ocm
command exit status after running the command (the invocation around
CLUSTER_JSON and the ocm list clusters call), log the actual error output and
exit non‑zero on failures, then only treat an empty CLUSTER_ID (the jq result
assigned to CLUSTER_ID) as the true "cluster not found" case for POOL_NAME;
ensure logs include the failing command, its stderr, and the POOL_NAME to aid
debugging.
Summary
rosa-pool-checkoutref uses ConfigMap-based lease tracking with optimistic concurrency (CAS viaoc replace) for safe concurrent claimsrosa-pool-checkinref returns clusters to the pool (best-effort, health check recovers stale leases)rosa-pool-healthref for periodic cluster health checks and stale lease recoveryrosa-pool-e2e-workflowworkflow: checkout -> operator-install -> e2e -> cleanup -> checkinThis is the step registry foundation for cluster standby pools. Infrastructure setup (namespace + RBAC on app.ci, pool cluster provisioning) tracked separately.
Jira: https://redhat.atlassian.net/browse/ROSAENG-59268
Test plan
yaml.safe_load)bash -n)make registry-metadatasucceeds (metadata files auto-generated)A detailed high-level summary could not be generated for this review. Here is an overview derived from the analyzed file changes: