Add a monitortest for cluster region/zone/instance type autodl data#31305
Add a monitortest for cluster region/zone/instance type autodl data#31305dgoodwin wants to merge 2 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughNew clusterinstancetypes monitor test collects cloud instance types and machine roles for AWS, Azure, and GCP. It queries cluster Infrastructure for platform and region, lists machines in openshift-machine-api, classifies roles via labels, extracts InstanceType from provider specs, deduplicates, and exports as AutoDL dataset and JSON. Registered in universal monitor registry. ChangesCluster Instance Types Monitor Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
pkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go (1)
92-97: ⚡ Quick winImplement or remove the placeholder role test.
TestGetMachineRolehas no assertions, so it always passes and currently provides no coverage for role-classification behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go` around lines 92 - 97, The TestGetMachineRole function is currently a placeholder with no test logic or assertions, so it provides no actual test coverage despite always passing. Either implement the test by adding actual test cases with assertions that verify the role-classification behavior indirectly through the pure functions classifyAWSZone and buildAutodlRows as noted in the comments, or remove the TestGetMachineRole function entirely if the role-classification logic is sufficiently covered by tests of those underlying pure functions elsewhere in the test suite.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/testframework/clusterinstancetypes/monitortest.go`:
- Around line 189-198: The getRegionFromInfrastructure function currently
handles AWS and GCP platform types in its switch statement but does not include
a case for Azure, even though Azure is explicitly supported as a platform. Add a
case for AzurePlatformType in the switch statement that checks if the Azure
platform status is not nil and returns the Region field from it, following the
same pattern used for AWS and GCP cases.
- Around line 217-249: The extractInstanceInfo function is silently ignoring
json.Unmarshal errors by only processing the specification when err == nil,
which causes invalid or missing data to be persisted. Modify the
extractInstanceInfo function signature to return both instanceInfo and an error.
For each platform case (aws, azure, gcp), when json.Unmarshal fails, return the
zero-value instanceInfo along with the error instead of silently proceeding.
Update all call sites of extractInstanceInfo throughout the file to handle the
returned error appropriately.
- Around line 143-149: The fallback path for region resolution silently ignores
errors from kubernetes.NewForConfig and Nodes().List operations by using if err
== nil checks, which violates the coding guideline to never ignore error
returns. Refactor this block to explicitly log or handle the errors from both
the kubernetes.NewForConfig call and the kubeClient.CoreV1().Nodes().List call
instead of discarding them. This will make region-resolution failures visible
and easier to diagnose.
- Around line 133-137: The code at line 133 dereferences
infra.Status.PlatformStatus.Type without first verifying that PlatformStatus is
not nil, which can cause a panic if the status is unset. Add a nil check for
infra.Status.PlatformStatus before attempting to access its Type field, and
return early (such as returning nil, nil) if the check fails, similar to the
existing platform type validation logic that follows.
- Around line 61-65: The collect method in the monitortest.go file currently
returns nil for the error when w.collect(ctx) fails, which masks the failure and
reports success instead of surfacing the error. Change the return statement in
the error handling block to return the actual err value as the third return
value instead of nil, so that collection failures are properly propagated to the
caller as expected by the monitor test contract.
- Around line 83-86: The early return when `len(rows) == 0` skips the JSON
snapshot output, losing valid collected metadata like platform and region
information. Remove the early return check after calling buildAutodlRows in the
code block containing the length check, allowing the JSON snapshot to be written
even when there are no AutoDL rows. This ensures that all collected metadata is
preserved in the output regardless of whether any instances were found.
---
Nitpick comments:
In `@pkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go`:
- Around line 92-97: The TestGetMachineRole function is currently a placeholder
with no test logic or assertions, so it provides no actual test coverage despite
always passing. Either implement the test by adding actual test cases with
assertions that verify the role-classification behavior indirectly through the
pure functions classifyAWSZone and buildAutodlRows as noted in the comments, or
remove the TestGetMachineRole function entirely if the role-classification logic
is sufficiently covered by tests of those underlying pure functions elsewhere in
the test suite.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b48dd313-3cf3-4641-8fc8-302fce9ad76e
📒 Files selected for processing (3)
pkg/defaultmonitortests/types.gopkg/monitortests/testframework/clusterinstancetypes/monitortest.gopkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go
3e3c214 to
009b20d
Compare
|
Scheduling required tests: |
|
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 009b20d
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go`:
- Around line 8-21: These tests are tautological as they construct literal rows
slices and then assert those same literals without exercising the actual dedup,
sort, or role extraction logic from the collect function. Refactor both test
cases to instead call the real collect function (or a factored pure helper used
by collect) with duplicate and mixed input data, and then assert the
deduplicated and properly sorted output matches the expected result. This
ensures the tests actually verify the dedup and sorting behavior rather than
just testing hard-coded literals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6e0a5fd3-f5dc-49a7-b0e5-f05a532ea139
📒 Files selected for processing (2)
pkg/monitortests/testframework/clusterinstancetypes/monitortest.gopkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go
75a4890 to
554daf9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go`:
- Line 16: The json.Marshal call on line 16 is discarding the error return value
using a blank identifier, which can produce invalid fixtures and mask test
failures. Instead of using the blank identifier, capture the error return value
from json.Marshal into a named variable and add proper error handling to check
if the error is non-nil. If an error occurs, handle it appropriately for the
test context (such as using a fatal assertion to fail the test with the error
details).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 67316fb2-8d12-4a90-a624-fdfbf4830b1b
📒 Files selected for processing (2)
pkg/monitortests/testframework/clusterinstancetypes/monitortest.gopkg/monitortests/testframework/clusterinstancetypes/monitortest_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/monitortests/testframework/clusterinstancetypes/monitortest.go
554daf9 to
8ebc142
Compare
|
Scheduling required tests: |
|
/retest |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 8ebc142
New tests seen in this PR at sha: 8ebc142
|
|
/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-azure-ovn |
|
@dgoodwin: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a06cead0-6a70-11f1-9349-b8dc57b0112b-0 |
|
/test unit |
|
Scheduling required tests: |
|
@dgoodwin: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Job Failure Risk Analysis for sha: 57eb894
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 57eb894
|
tthvo
left a comment
There was a problem hiding this comment.
I just have a few questions, but this is really cool 🤘✨
| machines, err := machineClientSet.MachineV1beta1().Machines("openshift-machine-api").List(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to list machines: %w", err) | ||
| } |
There was a problem hiding this comment.
I think we can get all the machine details from node labels. For example, we can extract the arch, instance type and region for all 3 aws, azure and gcp platforms👇:
kubernetes.io/arch: <arch>
node.kubernetes.io/instance-type: <type>
topology.kubernetes.io/region: <region>For control plane nodes, they will have these labels:
node-role.kubernetes.io/control-plane: ""
node-role.kubernetes.io/master: ""For worker nodes, they will have:
node-role.kubernetes.io/worker: ""There was a problem hiding this comment.
My thought is that we likely need to switch to check ClusterAPI machines in a near future. So if we can inspect nodes, it'd ease our maintenance. WDYT?
| Platform string `json:"platform"` | ||
| Region string `json:"region"` | ||
| Role string `json:"role"` | ||
| InstanceType string `json:"instance_type"` |
There was a problem hiding this comment.
On AWS, we also support local and wavelength zones via edge compute pool.
Let's also collect zone details (for AWS only?). We can get it by inspecting node label topology.kubernetes.io/zone, for example, in this run:
topology.kubernetes.io/zone: us-west-2-pdx-1a
Goal is to begin driving our support matrix based on actual data of what was tested. These autodl artifacts are automatically ingested into bigquery tables giving us an easy path to query what was successfully tested. (joining in details for the job run and limiting to only job runs that were fully passing)
We will soon explore porting the qe job which churns through the requested regions/instance types slowly over time with an equivalent on the main prow system.
With this we can then query for where we successfully validated openshift was working and generate whatever artifacts and customer docs needed.
Example output:
aws: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/31305/pull-ci-openshift-origin-main-e2e-aws-ovn-fips/2067008173180456960/artifacts/e2e-aws-ovn-fips/openshift-e2e-test/artifacts/junit/cluster-instance-types_20260616-233046-autodl.json
gcp: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/31305/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/2067008173406949376/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/cluster-instance-types_20260616-232245-autodl.json
azure: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-31305-ci-5.0-e2e-azure-ovn/2067295976640483328/artifacts/e2e-azure-ovn/openshift-e2e-test/artifacts/junit/cluster-instance-types_20260617-183416-autodl.json
Summary by CodeRabbit
Release Notes
New Features
Tests