Server thread safety #275

bigbrett · 2026-01-22T22:34:57Z

Server thread safety

TL;DR: Makes wolfHSM server safe to use in multithreaded scenarios.

Overview

This pull request implements thread-safe access to shared server resources in wolfHSM, specifically targeting the NVM (non-volatile memory) subsystem which also protects the global key cache. Crypto is left to a subsequent PR but is the likely next candidate.

Note that a server context itself still cannot be shared across threads without proper serialization by the caller. This PR adds the mechanisms such that, when multiple server contexts share an NVM instance (which includes the global keystore), access to those shared resources is properly serialized, allowing requests from multiple clients to be processed concurrently in separate threads.

Changes

Introduces lock abstraction layer (wh_lock.{c,h}) with callback-based design for platform independence
Example POSIX lock implementation using pthread_mutex
Adds server-level NVM locking API (wh_Server_NvmLock()/wh_Server_NvmUnlock()) with convenience macros WH_SERVER_NVM_LOCK()/WH_SERVER_NVM_UNLOCK()
All request handlers that access NVM or global keystore resources acquire the lock at the handler level before performing operations
Lower-level modules (NVM, keystore, counter, cert, etc.) remain lock-free; synchronization is the responsibility of the request handler layer
Thread safe functionality enabled with the WOLFHSM_CFG_THREADSAFE build option. When this option is NOT defined, all lock macros compile to no-ops with zero overhead
Adds "thread safe stress test" to test suite that attempts to flush out data races via a large number of contention cases, meant to be run under ThreadSanitizer

Design Rationale

The locking strategy is intentionally simple: acquire the NVM lock at the start of a request handler, perform all operations (including any compound operations involving multiple NVM/cache accesses), then release the lock. This approach:

Avoids TOCTOU issues - No risk of metadata becoming stale or objects being destroyed/replaced between checks
Makes lock scope visible - Locking is explicit at the handler level rather than hidden in lower layers

Gaps/Future Work

Serializing access to global crypto state, specifically hardware crypto for ports. A bit of a tricky problem since offload is provided at the port level, and there isn't a good way for wolfHSM to know which algos will be accelerated and which won't. A naive implementation might consider simply locking the server crypto context, but this contains a mixture of local (CMAC) and quasi-global (RNG) elements and no abstraction for hardware. Locks also need to be synchronized with the wolfCrypt port mutex. We should refactor the server crypto context and perhaps split it into local and global structures, with the global supporting hardware state. Future work...

…ety, serializing access to shared global resources like NVM and global keycache

billphipps

Truly excellent! You solved this just the way I had hoped for!
My requested changes are very limited and not really functional. More just fleshing out the exact requirements for a real implementation and a few minor typos and renaming opportunities.

The stress testing framework is outstanding!

wolfhsm/wh_lock.h

billphipps · 2026-01-25T16:28:39Z

src/wh_lock.c

+#include "wolfhsm/wh_lock.h"
+#include "wolfhsm/wh_error.h"
+
+#ifdef WOLFHSM_CFG_THREADSAFE


Is this the best name? Consider the more mundane WOLFHSM_CFG_LOCKS. Threadsafe may imply more than just locks, like cancelability.

yeah was kind of wishy washy on this. good point. Let me think on it.

test/wh_test_lock.c

test/wh_test_posix_threadsafe_stress.c

billphipps · 2026-01-25T16:47:35Z

test/wh_test_posix_threadsafe_stress.c

Consider adding posix into the name of this file since it heavily used posix to provide any real functionality.

Yeah it might be nice to organize our posix tests in one spot. maybe test/posix or port/posix/test/ so we can leave our wh_test_*.c stuff generic for all platforms

I really like that solution. +1

that is a good idea. Unfortunately a lot of our generic tests modules (e.g. wh_test_clientserver.c) contain both generic drivers as well as a POSIX harness (e.g. spins up the client + server threads). I think it might be best to push this out of scope of this PR and refactor the tests to better split generic test drivers (e.g. whTest_XXXClientCfg(whClientConfig*) and whTest_XXXCLientCtx(whClientCtx*)) from the actual underlying test harness. I'd wager we could reduce a lot of code that way with one or two unified harnesses that drivers just run on top of

rizlik

I didn't look into tests yet.
Great work.
Is this lock enough to properly synchronize client request?
Example, _HandleNvmRead:

    rc = wh_Nvm_GetMetadata(server->nvm, id, &meta);
    if (rc != WH_ERROR_OK) {
        return rc;
    }

    if (offset >= meta.len)
        return WH_ERROR_BADARGS;

    /* Clamp length to object size */
    if ((offset + len) > meta.len) {
        len = meta.len - offset;
    }

    rc = wh_Nvm_ReadChecked(server->nvm, id, offset, len, out_data);
    if (rc != WH_ERROR_OK)

metadata can be changed between GetMetadata and ReadChecked.
Also, when handling key request:

            /* get a new id if one wasn't provided */
            if (WH_KEYID_ISERASED(meta->id)) {
                ret     = wh_Server_KeystoreGetUniqueId(server, &meta->id);
                resp.rc = ret;
            }
            /* write the key */
            if (ret == WH_ERROR_OK) {
                ret     = wh_Server_KeystoreCacheKeyChecked(server, meta, in);
                resp.rc = ret;
            }

the id might not be unique anymore when _KeysotreCacheKeyCached.

Would more coarse granular locking at request level simplify the design?

src/wh_server_keystore.c

API/Error handling: - Add initialized flag to whLock structure to distinguish init states - Enhance error handling: acquire/release check initialized flag - Make wh_Lock_Cleanup zero structure for clear post-cleanup state - Document init/cleanup must be single-threaded (no atomics) - Document cleanup preconditions (no active contention required) - Update all API docs with precise return codes and error conditions - Change blocking acquire failure from ERROR_LOCKED to ERROR_ABORTED - Add comment explaining why non-blocking acquire is not provided POSIX port improvements: - Enhanced errno mapping in posix_lock.c (EINVAL→BADARGS, etc) - Trap PTHREAD_MUTEX_ERRORCHECK errors (EDEADLK, EPERM) Test coverage: - Add testUninitializedLock to validate error handling - Enhance testLockLifecycle with post-cleanup validation tests Misc: - Apply consistent critical section style pattern in wh_nvm.c - Update copyright years to 2026 - Rename stress test files to wh_test_posix_threadsafe_stress.*

bigbrett · 2026-01-27T18:04:23Z

@rizlik great catch, thanks. I thought I fixed all of those but clearly there are some non-atomic compound operations still lurking. I will make another pass to ensure I make them all atomic.

rizlik · 2026-01-27T18:24:08Z

@rizlik great catch, thanks. I thought I fixed all of those but clearly there are some non-atomic compound operations still lurking. I will make another pass to ensure I make them all atomic.

I wonder, if we are going to use a single lock, can't we just acquire the lock at wh_Server_HandleKeyRequest start and release the lock at the end (same for wh_Server_HandleNvmRequest)?

It's probably a tradeoff, we'll gain simplicity as we don't need locked vs unlocked APIs but there is the risk that other part of the code misuse Nvm API and introduce races in the future.

bigbrett · 2026-01-27T19:37:47Z

It's probably a tradeoff, we'll gain simplicity as we don't need locked vs unlocked APIs but there is the risk that other part of the code misuse Nvm API and introduce races in the future.

@rizlik yep that is what I was worried about and why I didn't initially try it that way ¯\_(ツ)_/¯

I'm not 100% sold on which is better

wolfhsm/wh_lock.h

src/wh_nvm.c

src/wh_server_keystore.c

…nter, img_mgr, and nvm modules Adds proper thread-safety locking discipline to additional server modules that perform compound NVM operations. This prevents TOCTOU (Time-Of-Check-Time-Of-Use) issues where metadata could become stale between check and use/writeback. Changes: - wh_server_cert.c: Add NVM locking for atomic GetMetadata + Read operations in certificate read and export paths - wh_server_counter.c: Add NVM locking for atomic read-modify-write counter increment operations - wh_server_img_mgr.c: Add NVM locking for atomic signature load operations - wh_server_keystore.c: Refactor to use unlocked internal variants for compound operations (GetUniqueId + CacheKey, policy check + erase, freshen + export). Add locking discipline documentation. - wh_server_nvm.c: Add NVM locking for DMA read operations to ensure metadata remains valid throughout transfer. Add locking discipline documentation. - wh_test_posix_threadsafe_stress.c: Add new stress test phases for counter concurrent increment, counter increment vs read, NVM read vs resize, NVM concurrent resize, and NVM read DMA vs resize. Add counter atomicity validation. All compound operations now follow the pattern: 1. Acquire server->nvm->lock 2. Use only *Unlocked() variants internally 3. Keep lock held for entire operation including DMA 4. Release lock after all metadata-dependent operations complete

AlexLanzano

Looks really good so far!

My main concern is the addition of *Unlocked functions. I feel like there has to be a way to remove those and still use the top level API functions by either checking if the current thread has already acquired the nvm lock. Or by creating a lock for both the keystore and the nvm.

test/Makefile

test/wh_test_lock.c

AlexLanzano · 2026-01-28T15:54:51Z

test/wh_test_posix_threadsafe_stress.c

Yeah it might be nice to organize our posix tests in one spot. maybe test/posix or port/posix/test/ so we can leave our wh_test_*.c stuff generic for all platforms

wolfhsm/wh_nvm_internal.h

…vel server module APIs (keystore, NVM, counter, etc.) and aquire lock in request handling functions (e.g. wh_Server_HandleXXXRequest())

WOLFHSM_CFG_THREADSAFE: Adds framework for internal server thread saf…

2cfc0e4

…ety, serializing access to shared global resources like NVM and global keycache

bigbrett requested review from AlexLanzano and billphipps January 22, 2026 23:26

bigbrett assigned billphipps and AlexLanzano and unassigned billphipps Jan 22, 2026

bigbrett requested review from JacobBarthelmeh and rizlik January 22, 2026 23:27

bigbrett mentioned this pull request Jan 23, 2026

authentication manager feature addition #270

Open

billphipps requested changes Jan 25, 2026

View reviewed changes

rizlik requested changes Jan 26, 2026

View reviewed changes

src/wh_server_keystore.c Outdated Show resolved Hide resolved

src/wh_server_keystore.c Outdated Show resolved Hide resolved

AlexLanzano reviewed Jan 27, 2026

View reviewed changes

wolfhsm/wh_lock.h Show resolved Hide resolved

AlexLanzano reviewed Jan 27, 2026

View reviewed changes

src/wh_nvm.c Show resolved Hide resolved

AlexLanzano reviewed Jan 27, 2026

View reviewed changes

src/wh_server_keystore.c Show resolved Hide resolved

AlexLanzano requested changes Jan 28, 2026

View reviewed changes

Massive refactor to locking integration. Pull locking out of lower le…

a58ca2b

…vel server module APIs (keystore, NVM, counter, etc.) and aquire lock in request handling functions (e.g. wh_Server_HandleXXXRequest())

bigbrett assigned bigbrett and unassigned AlexLanzano and billphipps Jan 28, 2026

bigbrett added 3 commits January 28, 2026 14:54

simplify trailing comment

9e7cfac

cleanups

af1b59f

Test housekeeping fixes, mostly macro protection

03719fa

bigbrett force-pushed the server-thread-safe branch from 07aebaf to 03719fa Compare January 28, 2026 22:41

rename CI action

2c446fa

bigbrett force-pushed the server-thread-safe branch from 667996d to 2c446fa Compare January 28, 2026 22:53

TSAN options to fail-fast in CI on error

968d8cd

Server thread safety #275

Are you sure you want to change the base?

Server thread safety #275

Conversation

bigbrett commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Server thread safety

Overview

Changes

Design Rationale

Gaps/Future Work

Uh oh!

billphipps left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

billphipps Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

bigbrett Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

billphipps Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

AlexLanzano Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

billphipps Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

bigbrett Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

rizlik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bigbrett commented Jan 27, 2026

Uh oh!

rizlik commented Jan 27, 2026

Uh oh!

bigbrett commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AlexLanzano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AlexLanzano Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bigbrett commented Jan 22, 2026 •

edited

Loading