Skip to content

cachedb_redis: add Unix socket transport and lazy connection#3856

Open
NormB wants to merge 6 commits intoOpenSIPS:masterfrom
NormB:mr/feature-redis-unix-socket-lazy
Open

cachedb_redis: add Unix socket transport and lazy connection#3856
NormB wants to merge 6 commits intoOpenSIPS:masterfrom
NormB:mr/feature-redis-unix-socket-lazy

Conversation

@NormB
Copy link
Copy Markdown
Member

@NormB NormB commented Mar 30, 2026

Summary

  • Add Unix domain socket support as an alternative to TCP
  • Add lazy_connect parameter to defer connection until first use

Details

Unix socket transport

The module now supports connecting to Redis via Unix domain sockets using a query parameter in the cachedb_url:

modparam("cachedb_redis", "cachedb_url",
    "redis:local://localhost/?socket=/var/run/redis/redis.sock")

The REDIS_UNIX_SOCKET flag distinguishes socket connections from TCP. MI output includes transport (tcp/unix) and socket_path fields. Unix socket connections are supported in both single-instance and cluster topology refresh paths.

Lazy connection

The new lazy_connect parameter (integer, default 0) defers the Redis connection from child_init to the first cache operation. This is useful when:

  • Redis may not be available at OpenSIPS startup
  • Multiple cachedb_url groups are configured but not all are needed immediately
  • Startup time matters and Redis connection latency is non-trivial

Both TCP and Unix socket connections support lazy mode.

Parameter Type Default Description
lazy_connect integer 0 Defer Redis connection until first cache operation

Testing

Suite Tests Description
test_unix_socket 19 Store/fetch, MI reporting, PING latency, recovery
test_lazy_connect 17 Deferred connect for all transport types, recovery

Compatibility

No behavioral change for existing TCP configurations. The lazy_connect parameter defaults to 0 (disabled).

Dependencies

Debian added 2 commits March 30, 2026 04:30
Fix several correctness and safety issues in parse_moved_reply()
and the MOVED redirect handler:

- Add slot value overflow protection: return ERR_INVALID_SLOT
  when parsed slot exceeds 16383 during digit accumulation,
  preventing signed integer overflow on malformed MOVED replies.

- Add port value overflow protection: return ERR_INVALID_PORT
  when parsed port exceeds 65535 during digit accumulation,
  complementing the existing post-loop range check and preventing
  signed integer overflow on malformed input.

- Fix undefined behavior in the no-colon endpoint fallback path:
  replace comparison of potentially-NULL out->endpoint.s against
  end pointer with (p < end), which achieves the same logic using
  the scan position variable that is always valid.

- Replace pkg_malloc heap allocation of redis_moved struct with
  stack allocation in the MOVED handler. The struct is small
  (~24 bytes) and never outlives the enclosing scope, making heap
  allocation unnecessary. This eliminates the OOM error path and
  two pkg_free() calls.
Replace the static cluster topology (built once at startup, never
refreshed) with runtime discovery and automatic refresh:

Topology discovery and refresh:
- Probe CLUSTER SHARDS (Redis 7+) with fallback to CLUSTER SLOTS
  (Redis 3+) for backward compatibility
- O(1) slot_table[16384] lookup replaces per-query linked-list scan
- Automatic topology refresh on MOVED redirect, connection failure,
  or query targeting an unmapped slot (rate-limited to 1/sec)
- Dynamic node creation when MOVED points to an unknown endpoint
- Stale node pruning during refresh with safe connection cleanup
- Cap redirect loop at 5 max redirects to prevent worker hang on
  pathological cluster state

Cluster observability via MI commands:
- redis_cluster_info: full topology dump including per-node connection
  status, slot assignments, query/error/moved/ask counters, and
  last activity timestamp
- redis_cluster_refresh: trigger manual topology refresh (bypasses
  rate limit)
- redis_ping_nodes: per-node PING with microsecond latency reporting
- All MI commands support optional group filter parameter

Statistics:
- redis_queries, redis_queries_failed, redis_moved, redis_ask,
  redis_topology_refreshes (module-level stat counters)
- Per-node query, error, moved, ask counters in redis_cluster_info

Hash slot correctness:
- Hash tag {…} extraction per Redis Cluster specification
- CRC16 modulo 16384 replaces bitwise AND with slots_assigned

ASK redirect handling:
- Detect ASK responses alongside existing MOVED handling
- Send ASKING command to target node before retrying original query
- Do not update slot map (ASK is a temporary mid-migration redirect)
- Refactor parse_moved_reply into parse_redirect_reply with prefix
  parameter; inline wrappers for backward compatibility

Connection reliability:
- TCP keepalive via redis_keepalive parameter (default 10s)
- Stack allocation for redis_moved structs (eliminates OOM paths)
- NULL guards on malformed CLUSTER SHARDS/SLOTS reply elements
- Integer overflow protection in slot and port parsing
- NULL guards in MI command handlers for group_name/initial_url

Documentation:
- New section: Redis Cluster Support (topology discovery, automatic
  refresh, MOVED/ASK handling, hash tags)
- MI command reference: redis_cluster_info, redis_cluster_refresh,
  redis_ping_nodes
- Authentication URL format documentation (classic, ACL, no-auth)
- New parameter: redis_keepalive

Test suite (186 tests):
- C unit tests: hash slot calculation (37), MI counter helpers (41)
- Integration: topology startup (12), ASK redirect (16), topology
  refresh (13), MI commands (50), edge cases (16)
- Trap EXIT handlers for safe cluster state restoration
- python3 preflight checks for JSON-dependent tests

Depends on: OpenSIPS#3815 (hash tag + modulo fix), OpenSIPS#3852 (ASK redirect)
Copy link
Copy Markdown

@dondetir dondetir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition @NormB — Unix socket support and lazy connect are both clean and well-tested.

I verified both features against a local setup:

  • Unix socket: store/fetch, MI reporting ("transport": "unix"), ping — all working
  • Lazy connect: confirmed empty node list before first operation, then full cluster auto-discovery on demand

One small thing: in redis_get_ctx() (line ~83) and redis_get_ctx_unix() (line ~129), the if (!ctx) NULL check after redisConnect()/redisConnectUnix() seems to have been dropped during the rewrite. The if (ctx && ctx->err != REDIS_OK) guard handles the error-context case, but if hiredis returns NULL (OOM), execution falls through to redisSetTimeout(ctx, ...). It looks like commit 8fb569cb3 had the fix but 16652c169 lost it in the rewrite. Easy to re-add:

if (!ctx) {
    LM_ERR("failed to connect to redis - out of memory\n");
    return NULL;
}

Everything else looks solid. The Unix socket + cluster mode guard and the lazy connect flow are well thought out.

NormB and others added 4 commits April 1, 2026 08:25
Exclude standalone test binaries (test_hash, test_mi_counters,
hash_under_test) from the UNIT_TESTS auto-discovery in Makefile.modules.
These files have their own main() and are built via test/Makefile;
pulling them into the module .so causes multiple-definition linker
errors.

Also remove the accidentally committed test/test_mi_counters ELF
binary and add it to .gitignore alongside test_hash.

Reported-by: dondetir <dondetir@users.noreply.github.com>
redisEnableKeepAliveWithInterval() was added in hiredis 1.0.0.
Ubuntu 20.04 ships hiredis 0.14, causing an implicit-function-declaration
error with -Werror. Gate on HIREDIS_MAJOR >= 1, falling back to
redisEnableKeepAlive() (no interval parameter) on older versions.
Add Unix domain socket support as an alternative to TCP connections:
- New URL format: redis:group://localhost/?socket=/path/to/sock
- REDIS_UNIX_SOCKET flag for connection and node identification
- MI output includes transport type (tcp/unix) and socket_path
- Unix socket path tracked in redis_con and cluster_node structs

Add lazy connection establishment:
- New lazy_connect module parameter (integer, default 0)
- Defers Redis connection until first cache operation
- Works for both TCP and Unix socket transport modes

Test suite:
- test_unix_socket.sh: 19 integration tests
- test_lazy_connect.sh: 17 integration tests
- Test stubs synced with production struct layout

Depends on: MR B (feature/redis-cluster-management)
If hiredis returns NULL (OOM), the previous `ctx && ctx->err` guard
skipped the error branch and fell through to redisSetTimeout(ctx, ...),
causing a NULL-pointer dereference. Split into separate !ctx and
ctx->err checks so OOM is caught before any ctx dereference.

Reported-by: dondetir <dondetir@users.noreply.github.com>
@NormB NormB force-pushed the mr/feature-redis-unix-socket-lazy branch from a392dbc to 829d118 Compare April 1, 2026 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants