Skip to content

fix: implement ThinEngine.CheckRuntimeReady to check actual runtime state#6062

Open
adityaupasani2 wants to merge 2 commits into
fluid-cloudnative:masterfrom
adityaupasani2:fix/thin-runtime-check-ready
Open

fix: implement ThinEngine.CheckRuntimeReady to check actual runtime state#6062
adityaupasani2 wants to merge 2 commits into
fluid-cloudnative:masterfrom
adityaupasani2:fix/thin-runtime-check-ready

Conversation

@adityaupasani2

Copy link
Copy Markdown
Contributor

Ⅰ. Describe what this PR does

ThinEngine.CheckRuntimeReady() in pkg/ddc/thin/runtime_info.go was an unimplemented stub that unconditionally returned true:

func (t *ThinEngine) CheckRuntimeReady() (ready bool) {
    //TODO implement me
    return true
}

This caused data operations (DataLoad, DataMigrate, DataProcess) to be dispatched against a ThinRuntime even when workers and FUSE pods were not actually ready, potentially causing failures.

Implements it using the existing CheckWorkersReady() and checkFuseHealthy() helpers that are already used by CheckRuntimeHealthy(), so all readiness signals are consistent.

Ⅱ. Does this pull request fix one issue?

Fixes #6060

Ⅲ. List the added test cases

Added 4 tests in health_check_test.go:

  • workers + fuse both ready → returns true
  • workers not ready → returns false
  • fuse not ready → returns false
  • fuse DaemonSet missing → returns false

Ⅳ. Describe how to verify it

Run go test ./pkg/ddc/thin/... -run CheckRuntimeReady -v

Ⅴ. Special notes for reviews

The implementation mirrors CheckRuntimeHealthy() exactly but returns a bool instead of an error, which is what the dataoperation.OperationInterface contract requires.

…tate

CheckRuntimeReady() was an unimplemented stub that unconditionally
returned true, causing data operations to be dispatched against a
ThinRuntime even when workers and FUSE pods were not actually ready.

Implement it using the existing CheckWorkersReady() and
checkFuseHealthy() helpers that are already used by
CheckRuntimeHealthy(), so all three readiness signals are consistent.

Also adds tests covering: workers+fuse both ready (returns true),
workers not ready (returns false), fuse not ready (returns false),
and fuse DaemonSet missing (returns false).

Fixes fluid-cloudnative#6060

Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
@fluid-e2e-bot

fluid-e2e-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cheyang for approval by writing /assign @cheyang in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot

fluid-e2e-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

Hi @adityaupasani2. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the CheckRuntimeReady method in pkg/ddc/thin/runtime_info.go to verify both worker readiness and fuse health. Additionally, it introduces comprehensive unit tests in pkg/ddc/thin/health_check_test.go to validate various success and failure scenarios for runtime readiness. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/ddc/thin/runtime_info.go Outdated
return false
}

fuseReady, err := t.checkFuseHealthy()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkFuseHealthy() delegates to ctrl.Helper.CheckAndSyncFuseStatus which unconditionally sets ready = true (see pkg/ctrl/fuse.go:61, comment: "fluid assumes fuse components are always ready"). This means CheckRuntimeReady() will never detect an unhealthy fuse, and the CI unittest "when fuse is not ready" fails for exactly this reason.

You either need (a) a dedicated fuse readiness check that inspects the DaemonSet's NumberUnavailable/NumberReady fields directly, or (b) to remove the fuse readiness assertion from CheckRuntimeReady (and the corresponding test) if the project intentionally treats fuse as always-ready.


func (t *ThinEngine) CheckRuntimeReady() (ready bool) {
//TODO implement me
workerReady, err := t.CheckWorkersReady()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alluxio and Jindo implement CheckRuntimeReady() by probing the master pod's responsiveness (fileUtils.Ready()). ThinRuntime has no master pod so the approach is necessarily different, but a brief comment explaining why worker+fuse status is used instead would help future readers.

//TODO implement me
workerReady, err := t.CheckWorkersReady()
if err != nil || !workerReady {
return false

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a t.Log.Info("runtime not ready", ...) statement before returning false, matching the pattern used in AlluxioEngine and JindoEngine. This will help operators debug readiness issues.

Comment thread pkg/ddc/cache/engine/master.go Outdated

// TODO(cache runtime): figure out how to use this selector
// runtimeToUpdate.Status.Selector = e.getWorkerSelectors()
runtimeToUpdate.Status.Selector = e.getWorkerSelectors()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheyang currently, the field for all runtime is not used, can we just delete this field ?

@cheyang

cheyang commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

I left comments on the blocking issues below.

Comment thread pkg/ddc/thin/runtime_info.go Outdated
return false
}

fuseReady, err := t.checkFuseHealthy()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for wiring CheckRuntimeReady to the existing helpers — calling CheckWorkersReady() and checkFuseHealthy() is the right structure and matches what CheckRuntimeHealthy already does.

The problem is that checkFuseHealthy() delegates to Helper.CheckAndSyncFuseStatus in pkg/ctrl/fuse.go, and that function still unconditionally does ready = true (there is an explicit // fluid assumes fuse components are always ready line). So even when the fuse DaemonSet reports NumberReady: 0 / NumberUnavailable: 1, CheckRuntimeReady will return true.

The CI unittest run reflects this — the new CheckRuntimeReady > when fuse is not ready > returns false case at pkg/ddc/thin/health_check_test.go:770 still fails on this PR head (Project Check / unittest is red on 7c394fe). To unblock, please either:

  1. Update the readiness check so it inspects the DaemonSet status directly (e.g. NumberUnavailable == 0 and NumberReady > 0, similar to how AlluxioEngine / JindoEngine derive fuse readiness), or
  2. Introduce a separate checkFuseReady() helper used only by CheckRuntimeReady that does this strict check, leaving the existing "assumes ready" status-sync behavior of CheckAndSyncFuseStatus untouched.

Option 2 is probably safer because it avoids changing the semantics that other code paths depend on. Once that is in place the when fuse is not ready Ginkgo case should pass and the unittest job should go green.

Comment thread pkg/ddc/cache/engine/master.go Outdated

// TODO(cache runtime): figure out how to use this selector
// runtimeToUpdate.Status.Selector = e.getWorkerSelectors()
runtimeToUpdate.Status.Selector = e.getWorkerSelectors()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 7c394fee ("populate Status.Selector in CacheEngine for worker pod discovery", Fixes #6063) is a different concern from the ThinEngine readiness fix described in the PR title and #6060. The change itself looks reasonable and consistent with how other runtimes set Status.Selector, but mixing it into this PR makes the readiness fix harder to review and ties the merge of #6060 to an unrelated #6063 fix.

Could you move the CacheEngine selector change to its own PR against #6063? That keeps this PR focused on the ThinEngine CheckRuntimeReady fix and lets the two land independently.

Comment thread pkg/ddc/thin/runtime_info.go Outdated
return false
}

fuseReady, err := t.checkFuseHealthy()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core blocker from the previous review is still present. checkFuseHealthy() delegates to ctrl.Helper.CheckAndSyncFuseStatus (see pkg/ctrl/fuse.go:60-61), which unconditionally sets ready = true with the comment "fluid assumes fuse components are always ready".

As a result, CheckRuntimeReady() can never detect an unhealthy fuse, and the CI test "when fuse is not ready -- returns false" (health_check_test.go:770) keeps failing.

To fix this you could:

  1. Add a dedicated fuse-readiness check in CheckRuntimeReady that inspects DaemonSetStatus.NumberUnavailable directly, rather than reusing checkFuseHealthy() which is designed for the existing always-ready semantics.
  2. Or override the behaviour in a thin-specific helper.

Changing CheckAndSyncFuseStatus globally would affect all runtimes, so a thin-local approach is probably safer.

}

ready := engine.CheckRuntimeReady()
Expect(ready).To(BeFalse())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unittest CI check fails because this test expects CheckRuntimeReady() to return false when fuse is unhealthy, but the underlying checkFuseHealthy() always returns true. The test correctly expresses the desired behavior; the production code needs to be updated to match it.

Comment thread pkg/ddc/cache/engine/master.go Outdated

// TODO(cache runtime): figure out how to use this selector
// runtimeToUpdate.Status.Selector = e.getWorkerSelectors()
runtimeToUpdate.Status.Selector = e.getWorkerSelectors()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainer @xliuqq asked whether the Status.Selector field should be removed entirely rather than populated, since it is currently unused across all runtimes. Please resolve that discussion before pushing additional commits for these cache engine changes.

//TODO implement me
workerReady, err := t.CheckWorkersReady()
if err != nil || !workerReady {
return false

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding t.Log.Info("runtime not ready", "workerReady", workerReady, "err", err) before returning false, matching the pattern used in AlluxioEngine and JindoEngine. This will help operators debug readiness issues.

…state

CheckRuntimeReady() was an unimplemented stub that unconditionally
returned true, causing data operations to be dispatched against a
ThinRuntime even when workers were not ready.

Implement it using CheckWorkersReady() — the same helper used by
CheckRuntimeHealthy(). Fuse readiness is intentionally excluded because
fluid treats fuse components as always-ready by design
(see pkg/ctrl/fuse.go: "fluid assumes fuse components are always ready").
ThinRuntime has no master pod so the Alluxio/Jindo approach of probing
master pod responsiveness does not apply.

Also adds logging before returning false, matching the pattern used by
AlluxioEngine and JindoEngine.

Fixes fluid-cloudnative#6060

Signed-off-by: Aditya Upasani <adityaupasani29@gmail.com>
@adityaupasani2 adityaupasani2 force-pushed the fix/thin-runtime-check-ready branch from 7c394fe to b79642c Compare June 26, 2026 10:08
@sonarqubecloud

Copy link
Copy Markdown

@adityaupasani2

Copy link
Copy Markdown
Contributor Author

Addressed all three review points:

  1. Fuse check removedcheckFuseHealthy() unconditionally returns ready = true because fluid intentionally treats fuse as always-ready (see pkg/ctrl/fuse.go:61). Removed the fuse check from CheckRuntimeReady() and the corresponding two fuse tests. Readiness is now determined solely by worker availability.

  2. Comment added — explaining why worker status is used instead of master pod probing (ThinRuntime has no master component, unlike Alluxio/Jindo).

  3. Logging addedt.Log.Error for errors and t.Log.Info("runtime not ready", ...) before returning false, matching the AlluxioEngine/JindoEngine pattern.

Also removed the unrelated CacheEngine selector commit that was accidentally on this branch — that change is already in its own PR #6064.

/assign @cheyang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ThinEngine.CheckRuntimeReady() always returns true without checking actual runtime state

3 participants