Summary
In the watch read loop, lib/resty/etcd/v3.lua does:
elseif body.error and body.error.http_code >= 500 then
health_check.report_failure(endpoint.http_host)
return nil, endpoint.http_host .. ": " .. body.error.http_status
end
When etcd returns a gRPC stream-level error such as EOF — body is {"error":{"code":14,"message":"error reading from server: EOF"}} — the error object has code/message but no http_code / http_status. So body.error.http_code >= 500 evaluates nil >= 500 and raises attempt to compare nil with number, aborting the watch lua thread.
Version
- lua-resty-etcd 1.10.6 (current latest release;
master still has the same line).
- Reproduced as bundled in Apache APISIX 3.17.0.
Impact (real production incident)
Because the watch coroutine aborts on the nil compare, it skips the caller's normal error-handling path, including the explicit cancel_watch(http_cli) path used by APISIX config_etcd.lua. Depending on when the HTTP connection is actually closed/GC'd, this can leave the etcd-side watch alive longer than intended. The consumer (APISIX config_etcd.lua) then restarts the watch via ngx_timer_at(0, run_watch) with no backoff, so recurring EOFs can become a tight reconnect loop and amplify etcd watch pressure.
This was discovered while investigating a production etcd watcher storm. Follow-up investigation found a separate primary cause for the sustained multi-million watcher growth: apisix-dashboard 3.0.0 leaked native Go etcd watchers during store reinitialization (see apisix/apisix-dashboard-watch-leak-upstream.md). Therefore this lua-resty-etcd bug should be presented upstream as an independent correctness/stability bug, not as the sole root cause of the final watcher storm.
The lua-resty-etcd bug is still real and reproducible: a stream-level watch error object without http_code crashes the Lua watch reader instead of returning a normal error to the caller.
Root cause (code)
lib/resty/etcd/v3.lua (~line 914 in 1.10.6):
body.error.http_code is nil for gRPC stream errors → nil >= 500 crash.
- Even if that compare were guarded,
body.error.http_status is also nil for these errors → the .. concat would crash next.
Reproduction
While watching a prefix, cause the etcd watch stream to terminate with EOF (etcd restart, intermediary proxy/LB idle timeout, or a network reset). The watch coroutine crashes with v3.lua:NNN: attempt to compare nil with number.
Suggested fix
Type-safe parse, and treat a missing http_code (transport/stream error) the same as 5xx — report failure and return gracefully so the connection is closed/rebuilt and the watcher is cancelled, instead of crashing:
elseif body.error then
local raw = body.error.http_code
local http_code = (type(raw) == "number" and raw)
or (type(raw) == "string" and tonumber(raw)) or nil
if http_code == nil or http_code >= 500 then
health_check.report_failure(endpoint.http_host)
return nil, endpoint.http_host .. ": "
.. (body.error.http_status or body.error.message or "watch stream error")
end
end
Optionally dispatch by gRPC code so client-side errors (PermissionDenied=7, InvalidArgument=3, …) are returned to the caller instead of being reported as endpoint failures.
After applying this patch in production, the attempt to compare nil with number crash stopped and EOF started returning through the normal error path. It did not fully stop the later watcher storm, which was traced to APISIX Dashboard's native Go etcd watcher leak. Keep the scope of any upstream PR limited to the nil-safe error handling bug.
Related (downstream)
Apache APISIX config_etcd.lua restarts the watch with ngx_timer_at(0, run_watch) (no backoff). Even with this lib fixed, a backoff there would harden against tight reconnect loops when etcd is briefly unavailable. Filing/again worth a separate APISIX hardening issue.
Summary
In the watch read loop,
lib/resty/etcd/v3.luadoes:When etcd returns a gRPC stream-level error such as EOF — body is
{"error":{"code":14,"message":"error reading from server: EOF"}}— the error object hascode/messagebut nohttp_code/http_status. Sobody.error.http_code >= 500evaluatesnil >= 500and raisesattempt to compare nil with number, aborting the watch lua thread.Version
masterstill has the same line).Impact (real production incident)
Because the watch coroutine aborts on the nil compare, it skips the caller's normal error-handling path, including the explicit
cancel_watch(http_cli)path used by APISIXconfig_etcd.lua. Depending on when the HTTP connection is actually closed/GC'd, this can leave the etcd-side watch alive longer than intended. The consumer (APISIXconfig_etcd.lua) then restarts the watch viangx_timer_at(0, run_watch)with no backoff, so recurring EOFs can become a tight reconnect loop and amplify etcd watch pressure.This was discovered while investigating a production etcd watcher storm. Follow-up investigation found a separate primary cause for the sustained multi-million watcher growth:
apisix-dashboard3.0.0 leaked native Go etcd watchers during store reinitialization (seeapisix/apisix-dashboard-watch-leak-upstream.md). Therefore this lua-resty-etcd bug should be presented upstream as an independent correctness/stability bug, not as the sole root cause of the final watcher storm.The lua-resty-etcd bug is still real and reproducible: a stream-level watch error object without
http_codecrashes the Lua watch reader instead of returning a normal error to the caller.Root cause (code)
lib/resty/etcd/v3.lua(~line 914 in 1.10.6):body.error.http_codeisnilfor gRPC stream errors →nil >= 500crash.body.error.http_statusis alsonilfor these errors → the..concat would crash next.Reproduction
While watching a prefix, cause the etcd watch stream to terminate with EOF (etcd restart, intermediary proxy/LB idle timeout, or a network reset). The watch coroutine crashes with
v3.lua:NNN: attempt to compare nil with number.Suggested fix
Type-safe parse, and treat a missing
http_code(transport/stream error) the same as 5xx — report failure and return gracefully so the connection is closed/rebuilt and the watcher is cancelled, instead of crashing:Optionally dispatch by gRPC
codeso client-side errors (PermissionDenied=7, InvalidArgument=3, …) are returned to the caller instead of being reported as endpoint failures.After applying this patch in production, the
attempt to compare nil with numbercrash stopped and EOF started returning through the normal error path. It did not fully stop the later watcher storm, which was traced to APISIX Dashboard's native Go etcd watcher leak. Keep the scope of any upstream PR limited to the nil-safe error handling bug.Related (downstream)
Apache APISIX
config_etcd.luarestarts the watch withngx_timer_at(0, run_watch)(no backoff). Even with this lib fixed, a backoff there would harden against tight reconnect loops when etcd is briefly unavailable. Filing/again worth a separate APISIX hardening issue.