SynaXG plugin dev scheme review #626

einsteinXue · 2026-01-09T07:03:30Z

Hi @bn222 @wizhaoredhat @thom311
I have submitted a pull request for the review of the design scheme. Would you be kind enough to help assess its rationality?

Certain implementation details — including the Dockerfile, Makefile, and SynaXG Plugin logic — have not yet been fully finalized. Please disregard these items for the time being and focus solely on the design scheme. Thank you.

First, I would like to provide some background:
The SynaXG card will not have Kubernetes or OpenShift Container Platform (OCP) installed. As such, the SynaXG VSP will run exclusively on the host side. The VSP’s core functions are twofold:
1） To perform DPU reboot by unbinding the target PCI address.
2） To implement firmware upgrade via gRPC — specifically, a gRPC server runs on the SynaXG card itself, while the corresponding gRPC client is deployed within the VSP pod. （This gRPC is only for firmware upgrade, has nothing to do with dpu-operator）

Let me briefly introduce my idea:
A "DataProcessingUnitConfig" CRD will be created for each DPU.

Daemon pods will be deployed on all DPU nodes in the cluster, meaning "HostSideManagers" will run on every DPU node.

The "DataProcessingUnitConfigReconciler" is configured within the "HostSideManager", so each DPU node can monitor changes to the "DataProcessingUnitConfig" CRD.

When a user adds "nodeName" and specifies "reboot DPU" in the CRD, each "DataProcessingUnitConfigReconciler" will verify whether its node is the target node. If the labels match, the "DataProcessingUnitConfigReconciler" will call the gRPC method to execute the reboot operation.

From my understanding, the gRPC connection is already established by the "HostSideManager"—thus, when the "DataProcessingUnitConfigReconciler" is initialized, "vsp" is passed in as an input parameter.

[api/v1/dataprocessingunitconfig_types.go]: Added CRD API definitions
[cmd/main.go]: Commented out the setup of dataProcessingUnitConfigReconciler (to be initialized in HostSideManager instead)
[dpu-api/api.proto]: Added DataProcessingUnitManagementService to the gRPC API to support DPU reboot and firmware upgrade operations
[internal/controller/dataprocessingunitconfig_controller.go]: Implemented reconciliation logic for the DataProcessingUnitConfig CRD, and executes "reboot" and "firmware upgrade" operations by invoking gRPC methods from the VSP
[internal/daemon/hostsidemanager.go]: Added setup logic for dataProcessingUnitConfigReconciler to monitor changes to the DataProcessingUnitConfig CRD
[internal/daemon/plugin/vendorplugin.go]: Added interfaces for DpuReboot and FirmwareUpgrade operations
[internal/platform/synaxg-dpu.go]: Added a detector for the SynaXG DPU platform
[internal/platform/vendordetector.go]: Integrated the SynaXG platform detector

openshift-ci · 2026-01-09T07:03:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: einsteinXue
Once this PR has been reviewed and has the lgtm label, please assign bn222 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-01-09T07:03:56Z

Hi @einsteinXue. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bn222 · 2026-01-14T11:04:40Z

/ok-to-test

openshift-ci · 2026-01-14T12:40:03Z

@einsteinXue: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/make-prow-ci-manifests-check	`3ef8145`	link	true	`/test make-prow-ci-manifests-check`
ci/prow/make-generate-check	`3ef8145`	link	true	`/test make-generate-check`
ci/prow/verify-deps	`3ef8145`	link	true	`/test verify-deps`
ci/prow/make-vendor-check	`3ef8145`	link	true	`/test make-vendor-check`
ci/prow/make-test	`3ef8145`	link	true	`/test make-test`
ci/prow/make-e2e-test	`3ef8145`	link	true	`/test make-e2e-test`
ci/prow/images	`3ef8145`	link	true	`/test images`
ci/prow/make-e2e-test-ptl	`3ef8145`	link	true	`/test make-e2e-test-ptl`
ci/prow/make-fmt-check	`3ef8145`	link	true	`/test make-fmt-check`
ci/prow/make-e2e-test-marvell	`3ef8145`	link	true	`/test make-e2e-test-marvell`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

einsteinXue · 2026-01-21T02:33:05Z

@bn222 some tests were fail.
It appears the current failures are related to the compilation process, as I haven't performed a full build of the project yet. Before I proceed further with development, could you please review the proposed design scheme? If the approach seems sound to you, I will move forward with the implementation.

wizhaoredhat · 2026-01-21T16:08:51Z

api/v1/dataprocessingunitconfig_types.go

+
+	// Firmware image path/package path (e.g. /quay.io/openshift/firmware/dpu:v1.0.8)
+	// +kubebuilder:validation:Required
+	FirmwarePath string `json:"firmwarePath,omitempty"`


Having your firmware inside a container (using it as File System) works for you?

Yes, thanks for the POC provided by you, it works now. We already verfied it in SynaXG-operator v1.2

We need a default firmware version hard-coded in the code. Since we want to ship the whole stack, we need full control of all components including the firmware. Otherwise we can't ensure that it works e2e.

bn222 · 2026-01-22T13:15:15Z

api/v1/dataprocessingunitconfig_types.go

+
+	// Detailed configuration for firmware upgrade, required when Operation is upgrade type
+	// +kubebuilder:validation:RequiredWhen=Operation,FirmwareUpgrade
+	Firmware DpuFirmwareSpec `json:"firmware,omitempty"`


Configuration for fw upgrade? That's new to me. Please explain the situation where that would be useful.

Take the SynaXG card as an example: the card contains an internal SDK. To upgrade this SDK, we want to avoid manual intervention and instead automate the process via an Operator.

It is important to note that the SynaXG card does not run Kubernetes or OpenShift (OCP); instead, it runs OAM (Operations, Administration, and Maintenance) software, which communicates with the SynaXG VSP through gRPC. The SynaXG VSP downloads the SDK image and transmits it to the SynaXG card via gRPC, where the local OAM software then executes the SDK installation.

The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account.

Are we talking about custom firmware? Where is the firmware released?

The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account.

CustomSoftware on card is just a gRPC server. vsp has a gRPC client(connected with the server on card). This means host side and card side are compatible.
Could this implementation potentially deviate from the intended architecture of the dpu-operator?

Are we talking about custom firmware? Where is the firmware released?

Yes. we released the firmware as a docker image here:
https://quay.io/repository/synaxgcom/sdk-img?tab=tags

If we are "supporting SynaXG" from the host side, it means we need to know what firmware is on the card (not what it does, only that it's a specific version). Can we extend the API to report the firmware version as a string?

Could you elaborate on the specific concerns you have regarding this?
Do you mean we shouldn't have 3 fields in DpuFirmwareSpec , instead just a firmware version is enough?

if we simply allow an opaque blob to configure the card, we don't have any way to have coverage for this. It opens the door to allow any firmware and any configuration to be programmed. At least, we should have some check somewhere that ties down what is allowed in OCP. This should hold even if it's on the card, since proper functioning of the OCP deployment depends on what's running on the card.

We want to manually control what is allowed. If you have a specific FW version (the ones you linked on quay), we need the code to check if it's a good firmware to use. This way, our QE knows what to test.

bn222 · 2026-01-22T13:16:29Z

api/v1/dataprocessingunitconfig_types.go

+	CompletionTime *metav1.Time `json:"completionTime,omitempty"`
+
+	// Upgrade-related versions (valid only when SubOperation is FirmwareUpgrade)
+	OriginalVersion string `json:"originalVersion,omitempty"` // Version before upgrade


Why omitempty?

This DpuNodeOperationStatus field manages both 'Reboot' and 'Firmware Upgrade' status. Note that OriginVersion is not required for 'Reboot' tasks

The DPU config CR is user facing, and I'm wondering what this field gives us.

If you upgrade the firmware ,this field will tell you the firmware version before upgrade, so that the user can compare it with the targetVersion.
But the name of this field is not accurate, maybe previousVersion is better?

previousVersion is indeed better. It seems to me that there is always some version running on the card. Can we report the previous version when we've detected it and remove the "omit empty" portion?

bn222 · 2026-01-22T13:17:02Z

api/v1/dataprocessingunitconfig_types.go

-// DataProcessingUnitConfigStatus defines the observed state of DataProcessingUnitConfig.
-type DataProcessingUnitConfigStatus struct {
+// DpuNodeOperationStatus defines the observed state of DataProcessingUnitConfig.
+type DpuNodeOperationStatus struct {


please make sure that we can use kubectl wait ... to block on firmware upgrades.

Do you mean that the update logic for the Phase field must be highly reliable? In other words, it must accurately reflect the actual real-time status of the operation?

Yes. We have the right infrastructure in place with APIs to ensure that we know when the card is up. We already ping the card from the host, and we track the state of the success of the ping. You could hook into that logic.

bn222 · 2026-01-22T13:19:05Z

dpu-api/gen/api.pb.go

@@ -517,6 +517,126 @@ func (x *PingResponse) GetHealthy() bool {
 	return false


We're going to need 2 commits, one for the actual changes, and one for the generated code to make review and history easier. That means one commit won't compile. I know that's painful. We've decided we favor review-ability instead of having being able to compile each commit. Note, we should still be able to compile each PR in its entirety.

bn222 · 2026-01-22T13:20:28Z