fix: implement ThinEngine.CheckRuntimeReady to check actual runtime state#6062
fix: implement ThinEngine.CheckRuntimeReady to check actual runtime state#6062adityaupasani2 wants to merge 2 commits into
Conversation
…tate CheckRuntimeReady() was an unimplemented stub that unconditionally returned true, causing data operations to be dispatched against a ThinRuntime even when workers and FUSE pods were not actually ready. Implement it using the existing CheckWorkersReady() and checkFuseHealthy() helpers that are already used by CheckRuntimeHealthy(), so all three readiness signals are consistent. Also adds tests covering: workers+fuse both ready (returns true), workers not ready (returns false), fuse not ready (returns false), and fuse DaemonSet missing (returns false). Fixes fluid-cloudnative#6060 Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @adityaupasani2. Thanks for your PR. I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Code Review
This pull request implements the CheckRuntimeReady method in pkg/ddc/thin/runtime_info.go to verify both worker readiness and fuse health. Additionally, it introduces comprehensive unit tests in pkg/ddc/thin/health_check_test.go to validate various success and failure scenarios for runtime readiness. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| return false | ||
| } | ||
|
|
||
| fuseReady, err := t.checkFuseHealthy() |
There was a problem hiding this comment.
checkFuseHealthy() delegates to ctrl.Helper.CheckAndSyncFuseStatus which unconditionally sets ready = true (see pkg/ctrl/fuse.go:61, comment: "fluid assumes fuse components are always ready"). This means CheckRuntimeReady() will never detect an unhealthy fuse, and the CI unittest "when fuse is not ready" fails for exactly this reason.
You either need (a) a dedicated fuse readiness check that inspects the DaemonSet's NumberUnavailable/NumberReady fields directly, or (b) to remove the fuse readiness assertion from CheckRuntimeReady (and the corresponding test) if the project intentionally treats fuse as always-ready.
|
|
||
| func (t *ThinEngine) CheckRuntimeReady() (ready bool) { | ||
| //TODO implement me | ||
| workerReady, err := t.CheckWorkersReady() |
There was a problem hiding this comment.
Alluxio and Jindo implement CheckRuntimeReady() by probing the master pod's responsiveness (fileUtils.Ready()). ThinRuntime has no master pod so the approach is necessarily different, but a brief comment explaining why worker+fuse status is used instead would help future readers.
| //TODO implement me | ||
| workerReady, err := t.CheckWorkersReady() | ||
| if err != nil || !workerReady { | ||
| return false |
There was a problem hiding this comment.
Consider adding a t.Log.Info("runtime not ready", ...) statement before returning false, matching the pattern used in AlluxioEngine and JindoEngine. This will help operators debug readiness issues.
|
|
||
| // TODO(cache runtime): figure out how to use this selector | ||
| // runtimeToUpdate.Status.Selector = e.getWorkerSelectors() | ||
| runtimeToUpdate.Status.Selector = e.getWorkerSelectors() |
There was a problem hiding this comment.
@cheyang currently, the field for all runtime is not used, can we just delete this field ?
|
I left comments on the blocking issues below. |
| return false | ||
| } | ||
|
|
||
| fuseReady, err := t.checkFuseHealthy() |
There was a problem hiding this comment.
Thanks for wiring CheckRuntimeReady to the existing helpers — calling CheckWorkersReady() and checkFuseHealthy() is the right structure and matches what CheckRuntimeHealthy already does.
The problem is that checkFuseHealthy() delegates to Helper.CheckAndSyncFuseStatus in pkg/ctrl/fuse.go, and that function still unconditionally does ready = true (there is an explicit // fluid assumes fuse components are always ready line). So even when the fuse DaemonSet reports NumberReady: 0 / NumberUnavailable: 1, CheckRuntimeReady will return true.
The CI unittest run reflects this — the new CheckRuntimeReady > when fuse is not ready > returns false case at pkg/ddc/thin/health_check_test.go:770 still fails on this PR head (Project Check / unittest is red on 7c394fe). To unblock, please either:
- Update the readiness check so it inspects the DaemonSet status directly (e.g.
NumberUnavailable == 0andNumberReady > 0, similar to how AlluxioEngine / JindoEngine derive fuse readiness), or - Introduce a separate
checkFuseReady()helper used only byCheckRuntimeReadythat does this strict check, leaving the existing "assumes ready" status-sync behavior ofCheckAndSyncFuseStatusuntouched.
Option 2 is probably safer because it avoids changing the semantics that other code paths depend on. Once that is in place the when fuse is not ready Ginkgo case should pass and the unittest job should go green.
|
|
||
| // TODO(cache runtime): figure out how to use this selector | ||
| // runtimeToUpdate.Status.Selector = e.getWorkerSelectors() | ||
| runtimeToUpdate.Status.Selector = e.getWorkerSelectors() |
There was a problem hiding this comment.
Commit 7c394fee ("populate Status.Selector in CacheEngine for worker pod discovery", Fixes #6063) is a different concern from the ThinEngine readiness fix described in the PR title and #6060. The change itself looks reasonable and consistent with how other runtimes set Status.Selector, but mixing it into this PR makes the readiness fix harder to review and ties the merge of #6060 to an unrelated #6063 fix.
Could you move the CacheEngine selector change to its own PR against #6063? That keeps this PR focused on the ThinEngine CheckRuntimeReady fix and lets the two land independently.
| return false | ||
| } | ||
|
|
||
| fuseReady, err := t.checkFuseHealthy() |
There was a problem hiding this comment.
The core blocker from the previous review is still present. checkFuseHealthy() delegates to ctrl.Helper.CheckAndSyncFuseStatus (see pkg/ctrl/fuse.go:60-61), which unconditionally sets ready = true with the comment "fluid assumes fuse components are always ready".
As a result, CheckRuntimeReady() can never detect an unhealthy fuse, and the CI test "when fuse is not ready -- returns false" (health_check_test.go:770) keeps failing.
To fix this you could:
- Add a dedicated fuse-readiness check in
CheckRuntimeReadythat inspectsDaemonSetStatus.NumberUnavailabledirectly, rather than reusingcheckFuseHealthy()which is designed for the existing always-ready semantics. - Or override the behaviour in a thin-specific helper.
Changing CheckAndSyncFuseStatus globally would affect all runtimes, so a thin-local approach is probably safer.
| } | ||
|
|
||
| ready := engine.CheckRuntimeReady() | ||
| Expect(ready).To(BeFalse()) |
There was a problem hiding this comment.
The unittest CI check fails because this test expects CheckRuntimeReady() to return false when fuse is unhealthy, but the underlying checkFuseHealthy() always returns true. The test correctly expresses the desired behavior; the production code needs to be updated to match it.
|
|
||
| // TODO(cache runtime): figure out how to use this selector | ||
| // runtimeToUpdate.Status.Selector = e.getWorkerSelectors() | ||
| runtimeToUpdate.Status.Selector = e.getWorkerSelectors() |
There was a problem hiding this comment.
Maintainer @xliuqq asked whether the Status.Selector field should be removed entirely rather than populated, since it is currently unused across all runtimes. Please resolve that discussion before pushing additional commits for these cache engine changes.
| //TODO implement me | ||
| workerReady, err := t.CheckWorkersReady() | ||
| if err != nil || !workerReady { | ||
| return false |
There was a problem hiding this comment.
Consider adding t.Log.Info("runtime not ready", "workerReady", workerReady, "err", err) before returning false, matching the pattern used in AlluxioEngine and JindoEngine. This will help operators debug readiness issues.
…state CheckRuntimeReady() was an unimplemented stub that unconditionally returned true, causing data operations to be dispatched against a ThinRuntime even when workers were not ready. Implement it using CheckWorkersReady() — the same helper used by CheckRuntimeHealthy(). Fuse readiness is intentionally excluded because fluid treats fuse components as always-ready by design (see pkg/ctrl/fuse.go: "fluid assumes fuse components are always ready"). ThinRuntime has no master pod so the Alluxio/Jindo approach of probing master pod responsiveness does not apply. Also adds logging before returning false, matching the pattern used by AlluxioEngine and JindoEngine. Fixes fluid-cloudnative#6060 Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
7c394fe to
b79642c
Compare
|
|
Addressed all three review points:
Also removed the unrelated CacheEngine selector commit that was accidentally on this branch — that change is already in its own PR #6064. /assign @cheyang |



Ⅰ. Describe what this PR does
ThinEngine.CheckRuntimeReady()inpkg/ddc/thin/runtime_info.gowas an unimplemented stub that unconditionally returnedtrue:This caused data operations (DataLoad, DataMigrate, DataProcess) to be dispatched against a ThinRuntime even when workers and FUSE pods were not actually ready, potentially causing failures.
Implements it using the existing
CheckWorkersReady()andcheckFuseHealthy()helpers that are already used byCheckRuntimeHealthy(), so all readiness signals are consistent.Ⅱ. Does this pull request fix one issue?
Fixes #6060
Ⅲ. List the added test cases
Added 4 tests in
health_check_test.go:truefalsefalsefalseⅣ. Describe how to verify it
Run
go test ./pkg/ddc/thin/... -run CheckRuntimeReady -vⅤ. Special notes for reviews
The implementation mirrors
CheckRuntimeHealthy()exactly but returns a bool instead of an error, which is what thedataoperation.OperationInterfacecontract requires.