fix: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101#13136
Open
guilongyang wants to merge 1 commit intoapache:masterfrom
Open
fix: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101#13136guilongyang wants to merge 1 commit intoapache:masterfrom
guilongyang wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
|
The current PR does not fix the root cause as described in the original issue. Please modify the technical solution. |
Contributor
|
Please revise your technical solution and reopen the PR. Thank you for your contribution. |
Author
|
sorry for late. Technical Solution: Fix ai-proxy-multi Health Check MechanismRoot CauseThe
SolutionI. Changes to
|
| State | Handling Method |
|---|---|
| Checker created | fetch_node_status to get status |
| Checker being created | get_target_status_by_path from SHM |
| No checker configured | Include by default |
2. Modified get_checkers_status_ver Function (Lines 223-232)
local function get_checkers_status_ver(checkers)
local status_ver_total = 0
for _, checker in pairs(checkers) do
-- Only count status_ver from real checkers (proxy checkers don't have this attribute)
if checker.status_ver then
status_ver_total = status_ver_total + checker.status_ver
end
end
return status_ver_total
endReason: Proxy checker doesn't have status_ver attribute, needs type check
3. Modified pick_target Function (Lines 356-359)
instances[i]._dns_value = instance._dns_value
instances[i]._nodes_ver = instance._nodes_ver
instances[i]._index = i - 1 -- Store index for later useKey points:
_dns_value: Syncs node info computed in request context to config object_index: Stores instance array index for building correct resource_path
Core Mechanism: Why _dns_value Can Be Correctly Accessed in Timer Context
Request Context:
resolve_endpoint(instance) → Sets instance._dns_value
↓
instances[i]._dns_value = instance._dns_value [Syncs to etcd copy]
↓
res_conf.value.plugins['ai-proxy-multi'].instances[i]._dns_value is set
Timer Context:
res_conf.value fetched from etcd
↓
jp.value(res_conf.value, json_path) extracts instance config
↓
upstream_constructor_config._dns_value exists ← because it was synced
↓
construct_upstream works correctly
Key Understanding:
- On first request,
_dns_valueis computed byresolve_endpointand set - Immediately afterward,
instances[i]._dns_value = instance._dns_valuewrites the value to the config object - When timer triggers, the config read from etcd already contains
_dns_value - The entire flow is "lazy initialization + state persistence", not recalculation each time
Summary
| Problem | Solution |
|---|---|
_dns_value missing in timer context |
Sync computed value to config object in request context, timer reads the synced value |
| Health checker lost after creation failure | Add SHM proxy, can read status even if checker not created |
| Caller blocks waiting for checker | Returns proxy checker immediately usable |
status_ver attribute missing |
Add type check to skip proxy checker |
Author
I don't have permission to reopen the PR. What should I do next? |
ffe2a47 to
476b727
Compare
Contributor
|
Hi @guilongyang, I've further refined the relevant code |
…g _dns_value The construct_upstream function requires _dns_value (a runtime-only field set by resolve_endpoint during request processing), but is also called from timer context by healthcheck_manager where _dns_value may not exist on the config object. This causes two crash scenarios: 1. timer_create_checker crashes when creating new checkers if _dns_value is missing 2. timer_working_pool_check crashes after any config update (route PATCH/PUT via Admin API) because the config cache is refreshed with a new object lacking _dns_value Both crashes produce "attempt to index local upstream (a nil value)" errors every second in an infinite loop, breaking health check management permanently. Changes: - Extract compute_endpoint_node() as a pure function from resolve_endpoint() in ai-proxy-multi.lua, so construct_upstream can compute the node from static config (endpoint URL or provider defaults) when _dns_value is not available - Add pcall protection and nil checks in healthcheck_manager.lua for both timer_create_checker and timer_working_pool_check to prevent crashes when construct_upstream fails for any reason Closes apache#13101
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[##] Description
hi, @Baoyuantop . Based on the 3.15.0-ubuntu Docker image, I have verified the changes in this PR and confirmed that the issue has been resolved.
Which issue(s) this PR fixes:
Fixes # #13101
Checklist