Skip to content

feat: Add ComponentStatus model in Flow — derive, persist, and surface per-component readiness#2264

Open
kunzhao-nv wants to merge 8 commits into
NVIDIA:mainfrom
kunzhao-nv:feat/flow-component-status
Open

feat: Add ComponentStatus model in Flow — derive, persist, and surface per-component readiness#2264
kunzhao-nv wants to merge 8 commits into
NVIDIA:mainfrom
kunzhao-nv:feat/flow-component-status

Conversation

@kunzhao-nv
Copy link
Copy Markdown
Contributor

@kunzhao-nv kunzhao-nv commented Jun 5, 2026

Description

Adds ComponentStatus — Flow's unified view of a component's operability, derived from Core's per-type state machine (compute ManagedHostState, switch SwitchControllerState, power-shelf PowerShelfControllerState). Inventorysync recomputes it every cycle, stores it as a jsonb column on component, and the gRPC API surfaces it on every Component response.
ComponentStatus has three fields:

phase              UNKNOWN | INITIALIZING | READY | IN_USE | ERROR | DELETING
reason             raw core state string, kept for diagnostics
blocked_operations subset of {POWER_CONTROL, FIRMWARE_CONTROL} disallowed in this phase

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@kunzhao-nv kunzhao-nv requested a review from a team as a code owner June 5, 2026 20:14
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kunzhao-nv kunzhao-nv requested a review from jw-nvidia June 5, 2026 20:15
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffa0768e-9bcb-4c0c-9b46-d81a1f3c0cff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@kunzhao-nv kunzhao-nv force-pushed the feat/flow-component-status branch 2 times, most recently from 156e122 to 0e415ad Compare June 5, 2026 21:01
Introduces the Flow-side representation of a component's operability,
derived from core's per-component state machine:

- Phase: coarse lifecycle bucket shared by compute, nvswitch, and
  power shelf (Unknown / Initializing / Ready / InUse / Error /
  Deleting).
- ComponentStatus: Phase + Reason + BlockedOperations, with
  IsReady / Blocks helpers.
- MapComponentStatus: per-type translation from core's raw
  controller_state to ComponentStatus.

Compute uses ManagedHostState's path-form Display ("Ready",
"Assigned/...", etc.); switch and power shelf use the serde-tagged
JSON form ({"state":"ready"}). Unrecognized inputs map to
PhaseUnknown so downstream gating fails closed.

No callers yet; integration follows in subsequent commits.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Adds the wire types for the new component status concept:

- Phase enum mirroring pkg/types.Phase (Unknown / Initializing /
  Ready / InUse / Error / Deleting).
- ComponentStatus message with phase, reason, and blocked_operations.
- Component.status field (= 9) carrying the live status of each
  component returned by Flow's inventory APIs.

Regenerates flow.pb.go, flow_grpc.pb.go, and docs/grpc-api.{md,html}.
Server code does not populate the field yet — wired up in subsequent
commits.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Adds a jsonb `status` column to the component table to persist Flow's per-component ComponentStatus (phase, reason, blocked_operations) computed by the inventory loop from core's controller_state. Single jsonb so the shape can evolve without further DDL.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Wires the per-type ComponentStatus mapper into the inventory loop and
writes the result to the new component.status column on every cycle.

- nicoapi: FindSwitchControllerStates / FindPowerShelfControllerStates
  expose the raw controller_state string Core returns for switches and
  power shelves (compute already carried it on MachineDetail.State).
  Mock helpers (SetSwitchControllerState /
  SetPowerShelfControllerState) follow the existing rack-id pattern.
- model.Component: SetStatusByComponentID writes status by external_id.
- inventorysync: syncMachineStatuses uses the pre-fetched MachineDetail
  map (no extra RPC); syncSwitchStatuses and syncPowershelfStatuses each
  add one nicoapi round-trip. persistComponentStatuses centralises the
  delta-detect-and-write pattern.
- pkg/types: ComponentStatus.Equal lets the delta check avoid pointless
  writes (struct contains a slice and is therefore not == comparable).

Status is only written when it actually changes.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Carry the persisted ComponentStatus through the DB → domain → proto
chain so ListComponents / GetComponent callers see the Flow-derived
view of operability.

- domain Component (pkg/inventoryobjects/component) gains an optional
  Status pointer; nil means "no status computed yet".
- dao.ComponentFrom copies model.Component.Status through.
- protobuf.ComponentTo populates the new pb.Component.status field via
  the new ComponentStatusTo / PhaseTo converters.
- Added operationTypeFromTypesTo for the types.OperationType → proto
  enum mapping (kept separate from the existing
  OperationTypeToProto, which converts from taskcommon.TaskType).

No reverse converter: status is read-only over the API surface.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
ReadinessGate holds mutating component operations until every target
component's persisted ComponentStatus permits them. Inventory sync
already writes the status, so the gate reads it from the component
table instead of polling Core's state-machine endpoints on every
iteration.

Layout in this commit:
- gate.go defines the Gate / StatusReader interfaces and DBGate, the
  production polling loop. Permissive on missing status (fail-open on
  transient gaps); the rack-scoped form resolves rack -> host
  components.
- db_reader.go implements StatusReader against bun.IDB by reading the
  component table.
- gate_test.go drives DBGate with an in-memory fake reader: covers
  empty / nil-gate short-circuits, ready / missing / blocking /
  partial-blocking / op-scoped, transition-mid-poll, dedup, context
  cancellation, and the rack delegation path.

Production call sites are not migrated in this commit.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Call sites in compute/nvswitch/powershelf managers have Core
(external) component / rack IDs as []string — that's what flows
through the Temporal task target. Rekey the gate to match so future
callers don't have to convert at every call site.

- StatusReader / Gate methods now take []string. DBReader joins by
  component.external_id (string) and component.rack_id (uuid parsed
  from string).
- MemReader is a new exported in-memory StatusReader for test packages
  outside readiness — mirrors nicoapi.NewMockClient so manager tests
  can build a realistic gate without spinning up a DB. The
  package-internal gate tests keep their own fakeReader so they can
  count poll iterations.
- Test IDs renamed away from words like "ready"/"blocked" to avoid
  matching substrings of the gate's own log / error messages.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Mirrors the convention already used by rack, nvldomain, and task_schedule: a single updated_at column stamped by the shared set_updated_at trigger on every UPDATE. This gives callers one freshness signal for the row regardless of which field changed (power_state, firmware_version, status, description, ...), rather than per-field timestamps.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
@kunzhao-nv kunzhao-nv force-pushed the feat/flow-component-status branch from 0e415ad to be6abe4 Compare June 5, 2026 21:12
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-05 21:15:24 UTC | Commit: be6abe4

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 116 13 50 41 4 8
nico-nsm 133 11 45 66 11 0
nico-psm 118 13 52 41 4 8
nico-rest-api 182 16 84 67 7 8
nico-rest-cert-manager 95 5 47 32 3 8
nico-rest-db 116 13 50 41 4 8
nico-rest-site-agent 115 13 50 41 3 8
nico-rest-site-manager 102 6 48 37 3 8
nico-rest-workflow 118 13 52 41 4 8
TOTAL 1095 103 478 407 43 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@kunzhao-nv kunzhao-nv changed the title feat: Add ComponentStatus model — derive, persist, and surface per-component readiness feat: Add ComponentStatus model in Flow — derive, persist, and surface per-component readiness Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant