Skip to content

watch: "attempt to compare nil with number" in v3.lua on etcd gRPC stream error (EOF) — body.error.http_code is nil #222

Description

@pubyun

Summary

In the watch read loop, lib/resty/etcd/v3.lua does:

elseif body.error and body.error.http_code >= 500 then
    health_check.report_failure(endpoint.http_host)
    return nil, endpoint.http_host .. ": " .. body.error.http_status
end

When etcd returns a gRPC stream-level error such as EOF — body is {"error":{"code":14,"message":"error reading from server: EOF"}} — the error object has code/message but no http_code / http_status. So body.error.http_code >= 500 evaluates nil >= 500 and raises attempt to compare nil with number, aborting the watch lua thread.

Version

  • lua-resty-etcd 1.10.6 (current latest release; master still has the same line).
  • Reproduced as bundled in Apache APISIX 3.17.0.

Impact (real production incident)

Because the watch coroutine aborts on the nil compare, it skips the caller's normal error-handling path, including the explicit cancel_watch(http_cli) path used by APISIX config_etcd.lua. Depending on when the HTTP connection is actually closed/GC'd, this can leave the etcd-side watch alive longer than intended. The consumer (APISIX config_etcd.lua) then restarts the watch via ngx_timer_at(0, run_watch) with no backoff, so recurring EOFs can become a tight reconnect loop and amplify etcd watch pressure.

This was discovered while investigating a production etcd watcher storm. Follow-up investigation found a separate primary cause for the sustained multi-million watcher growth: apisix-dashboard 3.0.0 leaked native Go etcd watchers during store reinitialization (see apisix/apisix-dashboard-watch-leak-upstream.md). Therefore this lua-resty-etcd bug should be presented upstream as an independent correctness/stability bug, not as the sole root cause of the final watcher storm.

The lua-resty-etcd bug is still real and reproducible: a stream-level watch error object without http_code crashes the Lua watch reader instead of returning a normal error to the caller.

Root cause (code)

lib/resty/etcd/v3.lua (~line 914 in 1.10.6):

  • body.error.http_code is nil for gRPC stream errors → nil >= 500 crash.
  • Even if that compare were guarded, body.error.http_status is also nil for these errors → the .. concat would crash next.

Reproduction

While watching a prefix, cause the etcd watch stream to terminate with EOF (etcd restart, intermediary proxy/LB idle timeout, or a network reset). The watch coroutine crashes with v3.lua:NNN: attempt to compare nil with number.

Suggested fix

Type-safe parse, and treat a missing http_code (transport/stream error) the same as 5xx — report failure and return gracefully so the connection is closed/rebuilt and the watcher is cancelled, instead of crashing:

elseif body.error then
    local raw = body.error.http_code
    local http_code = (type(raw) == "number" and raw)
                      or (type(raw) == "string" and tonumber(raw)) or nil
    if http_code == nil or http_code >= 500 then
        health_check.report_failure(endpoint.http_host)
        return nil, endpoint.http_host .. ": "
                    .. (body.error.http_status or body.error.message or "watch stream error")
    end
end

Optionally dispatch by gRPC code so client-side errors (PermissionDenied=7, InvalidArgument=3, …) are returned to the caller instead of being reported as endpoint failures.

After applying this patch in production, the attempt to compare nil with number crash stopped and EOF started returning through the normal error path. It did not fully stop the later watcher storm, which was traced to APISIX Dashboard's native Go etcd watcher leak. Keep the scope of any upstream PR limited to the nil-safe error handling bug.

Related (downstream)

Apache APISIX config_etcd.lua restarts the watch with ngx_timer_at(0, run_watch) (no backoff). Even with this lib fixed, a backoff there would harden against tight reconnect loops when etcd is briefly unavailable. Filing/again worth a separate APISIX hardening issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions