Skip to content

feat: Log every gRPC call on server and client paths in Flow#2276

Open
kunzhao-nv wants to merge 3 commits into
NVIDIA:mainfrom
kunzhao-nv:feat/flow-grpc-logging
Open

feat: Log every gRPC call on server and client paths in Flow#2276
kunzhao-nv wants to merge 3 commits into
NVIDIA:mainfrom
kunzhao-nv:feat/flow-grpc-logging

Conversation

@kunzhao-nv
Copy link
Copy Markdown
Contributor

@kunzhao-nv kunzhao-nv commented Jun 5, 2026

Description

  • Add flow/internal/common/grpclog/ with unary interceptors and wire them into the Flow gRPC server and the NICo (Core) gRPC client. One structured log line per RPC: grpc.service / grpc.method / grpc.code / grpc.duration_ms / grpc.peer or grpc.target. Level by status code. No payloads (BMC creds).
  • Raise force-power-on Stage 4 timeouts to 20m stage / 5m action across all 12 entries (was 4m / 3m, too tight for PowerShelf and NVSwitch).

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Adds a shared grpclog package and wires unary interceptors into both the
Flow gRPC server and the Core (NICo) gRPC client. Each RPC now emits one
structured log line with method, duration_ms, status code, and peer or
target. Payloads are intentionally not logged: AddExpectedMachine /
AddExpectedSwitch carry BMC credentials.

Log level is derived from the status code: OK and expected business
outcomes (NotFound, AlreadyExists, ...) at Info; transient outcomes
(DeadlineExceeded, Aborted) at Warn; infrastructure failures
(Unavailable, Internal) at Error. Same classifier on both sides so a
given code reads identically wherever it surfaces.

Closes the observability gap in step 4 of the documented debug flow
(REST -> workflow -> activity -> Flow -> Core), which previously had
zero per-RPC logs in either direction.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Bump the per-component verification stage from 4m / 3m (stage / action)
to 20m / 5m. The previous values left the inner verify-power-status
action with only 3m to converge, which is tight for PowerShelves and
NVSwitches after a forced power transition.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
@kunzhao-nv kunzhao-nv requested a review from a team as a code owner June 5, 2026 22:58
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f7b362d9-348c-4b19-b2f8-60cc37459abd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@kunzhao-nv kunzhao-nv changed the title feat: log every gRPC call on server and client paths in Flow feat: Log every gRPC call on server and client paths in Flow Jun 5, 2026
@kunzhao-nv kunzhao-nv requested a review from aswaroop-nv June 5, 2026 22:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-05 23:01:20 UTC | Commit: 4c30d23

@kunzhao-nv kunzhao-nv requested a review from jw-nvidia June 5, 2026 23:03
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 116 13 50 41 4 8
nico-nsm 133 11 45 66 11 0
nico-psm 118 13 52 41 4 8
nico-rest-api 182 16 84 67 7 8
nico-rest-cert-manager 95 5 47 32 3 8
nico-rest-db 116 13 50 41 4 8
nico-rest-site-agent 115 13 50 41 3 8
nico-rest-site-manager 102 6 48 37 3 8
nico-rest-workflow 118 13 52 41 4 8
TOTAL 1095 103 478 407 43 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

… Core

Each inventory sync iteration now logs one summary line with
compute / nvswitches / powershelves counts received from Core's
GetMachines / GetAllExpectedSwitchesLinked / GetAllExpectedPowerShelvesLinked
calls. Lets operators sanity-check "did Core return anything for this
type?" without grepping per-component logs.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants