cachedb_redis: add dynamic cluster topology management and observability#3855
cachedb_redis: add dynamic cluster topology management and observability#3855NormB wants to merge 4 commits intoOpenSIPS:masterfrom
Conversation
Fix several correctness and safety issues in parse_moved_reply() and the MOVED redirect handler: - Add slot value overflow protection: return ERR_INVALID_SLOT when parsed slot exceeds 16383 during digit accumulation, preventing signed integer overflow on malformed MOVED replies. - Add port value overflow protection: return ERR_INVALID_PORT when parsed port exceeds 65535 during digit accumulation, complementing the existing post-loop range check and preventing signed integer overflow on malformed input. - Fix undefined behavior in the no-colon endpoint fallback path: replace comparison of potentially-NULL out->endpoint.s against end pointer with (p < end), which achieves the same logic using the scan position variable that is always valid. - Replace pkg_malloc heap allocation of redis_moved struct with stack allocation in the MOVED handler. The struct is small (~24 bytes) and never outlives the enclosing scope, making heap allocation unnecessary. This eliminates the OOM error path and two pkg_free() calls.
Replace the static cluster topology (built once at startup, never
refreshed) with runtime discovery and automatic refresh:
Topology discovery and refresh:
- Probe CLUSTER SHARDS (Redis 7+) with fallback to CLUSTER SLOTS
(Redis 3+) for backward compatibility
- O(1) slot_table[16384] lookup replaces per-query linked-list scan
- Automatic topology refresh on MOVED redirect, connection failure,
or query targeting an unmapped slot (rate-limited to 1/sec)
- Dynamic node creation when MOVED points to an unknown endpoint
- Stale node pruning during refresh with safe connection cleanup
- Cap redirect loop at 5 max redirects to prevent worker hang on
pathological cluster state
Cluster observability via MI commands:
- redis_cluster_info: full topology dump including per-node connection
status, slot assignments, query/error/moved/ask counters, and
last activity timestamp
- redis_cluster_refresh: trigger manual topology refresh (bypasses
rate limit)
- redis_ping_nodes: per-node PING with microsecond latency reporting
- All MI commands support optional group filter parameter
Statistics:
- redis_queries, redis_queries_failed, redis_moved, redis_ask,
redis_topology_refreshes (module-level stat counters)
- Per-node query, error, moved, ask counters in redis_cluster_info
Hash slot correctness:
- Hash tag {…} extraction per Redis Cluster specification
- CRC16 modulo 16384 replaces bitwise AND with slots_assigned
ASK redirect handling:
- Detect ASK responses alongside existing MOVED handling
- Send ASKING command to target node before retrying original query
- Do not update slot map (ASK is a temporary mid-migration redirect)
- Refactor parse_moved_reply into parse_redirect_reply with prefix
parameter; inline wrappers for backward compatibility
Connection reliability:
- TCP keepalive via redis_keepalive parameter (default 10s)
- Stack allocation for redis_moved structs (eliminates OOM paths)
- NULL guards on malformed CLUSTER SHARDS/SLOTS reply elements
- Integer overflow protection in slot and port parsing
- NULL guards in MI command handlers for group_name/initial_url
Documentation:
- New section: Redis Cluster Support (topology discovery, automatic
refresh, MOVED/ASK handling, hash tags)
- MI command reference: redis_cluster_info, redis_cluster_refresh,
redis_ping_nodes
- Authentication URL format documentation (classic, ACL, no-auth)
- New parameter: redis_keepalive
Test suite (186 tests):
- C unit tests: hash slot calculation (37), MI counter helpers (41)
- Integration: topology startup (12), ASK redirect (16), topology
refresh (13), MI commands (50), edge cases (16)
- Trap EXIT handlers for safe cluster state restoration
- python3 preflight checks for JSON-dependent tests
Depends on: OpenSIPS#3815 (hash tag + modulo fix), OpenSIPS#3852 (ASK redirect)
dondetir
left a comment
There was a problem hiding this comment.
@NormB — I went through the full PR and ran it against a local 3-node Redis 7.x cluster. Solid work across the board.
Everything checked out well in testing:
- Unit tests: 37/37 (hash) + 34/34 (MI counters), clean under valgrind
- Cluster operations: store/fetch across all nodes, MOVED handling after live reshard, node failure + recovery — all working as expected
- MI commands (
redis_cluster_info,redis_cluster_refresh,redis_ping_nodes) all functional with correct output
One minor thing — the test files (test_hash.c, test_mi_counters.c, hash_under_test.c) are standalone executables with their own main(), which works great when built via the test Makefile. However, when UNIT_TESTS is enabled in Makefile.conf, the build system's auto-discovery (Makefile.defs:1695) pulls test/*.c into the module .so, causing multiple-definition linker errors. A quick #ifndef UNIT_TESTS guard around main() or an exclude rule in the module Makefile should do the trick.
Also spotted a committed binary — test/test_mi_counters (the compiled ELF) is tracked in git. Looks like test/.gitignore lists test_hash but not test_mi_counters.
Code looks great overall. The rate-limited topology refresh and the seen flag reconciliation pattern are clean and effective.
Exclude standalone test binaries (test_hash, test_mi_counters, hash_under_test) from the UNIT_TESTS auto-discovery in Makefile.modules. These files have their own main() and are built via test/Makefile; pulling them into the module .so causes multiple-definition linker errors. Also remove the accidentally committed test/test_mi_counters ELF binary and add it to .gitignore alongside test_hash. Reported-by: dondetir <dondetir@users.noreply.github.com>
redisEnableKeepAliveWithInterval() was added in hiredis 1.0.0. Ubuntu 20.04 ships hiredis 0.14, causing an implicit-function-declaration error with -Werror. Gate on HIREDIS_MAJOR >= 1, falling back to redisEnableKeepAlive() (no interval parameter) on older versions.
Summary
redis_cluster_info,redis_cluster_refresh,redis_ping_nodes){...}extraction and fix hash slot calculationDetails
The
cachedb_redismodule currently builds its cluster topology once duringchild_initviaCLUSTER NODESand never updates it. Adding or removing a Redis node requires restarting OpenSIPS. This PR replaces that static model with full runtime topology management.Topology discovery and refresh
At startup, the module probes the cluster using
CLUSTER SHARDS(Redis 7+) with automatic fallback toCLUSTER SLOTS(Redis 3+). The topology is stored in an O(1)slot_table[16384]array mapping each slot directly to its owning master node.The topology refreshes automatically when:
MOVEDredirection is received (permanent slot migration)redis_cluster_refreshvia MIRefreshes are rate-limited to once per second. The MI command bypasses the rate limit. During refresh, nodes not present in the new topology are pruned and their connections freed. New nodes are created and connected on demand.
The redirect retry loop is capped at 5 total redirects to prevent a worker from hanging on a pathological cluster state.
Hash slot correctness
The existing
redisHash()usescrc16(key) & con->slots_assignedwhich only produces correct results whenslots_assignedis2^n - 1. This is replaced withcrc16(key) % 16384per the Redis Cluster specification.Hash tag extraction is added: if a key contains
{substring}, only the substring between the first{and the next}is hashed, enabling key co-location (e.g.,{user:1000}.sessionand{user:1000}.profileland on the same slot).MI commands
redis_cluster_infogroupfilter.redis_cluster_refreshgroupfilter.redis_ping_nodesgroupfilter.Statistics
redis_queries,redis_queries_failed,redis_moved,redis_topology_refreshesNew parameter
redis_keepaliveTesting
170 tests across 7 suites:
All integration tests include
trap EXITcleanup handlers to restore cluster state on unexpected exit.Compatibility
{...}) produce identical slot values — no behavioral change for existing deploymentsCLUSTER NODESparsing is replaced byCLUSTER SHARDS/CLUSTER SLOTSslots_assignedfield is removed fromredis_conDependencies
This PR includes hash tag support for standalone functionality. If #3815 merges first, the overlap resolves cleanly on rebase.
Closing issues
Partially addresses #2811