Skip to content

Make env* functions thread safe#5735

Open
pbarejko wants to merge 2 commits into
isaac-sim:developfrom
pbarejko:pbarejko/thread-safe-env
Open

Make env* functions thread safe#5735
pbarejko wants to merge 2 commits into
isaac-sim:developfrom
pbarejko:pbarejko/thread-safe-env

Conversation

@pbarejko
Copy link
Copy Markdown
Collaborator

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
List any dependencies that are required for this change.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (existing functionality will not work without user modification)
  • Documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team infrastructure labels May 21, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Thanks for tackling this intermittent CI crash! The LD_PRELOAD approach is the standard workaround for third-party code calling non-thread-safe glibc functions—ASAN uses the same technique.

🔴 Critical Issue

Invalid COPY path in Dockerfile.base

COPY ../docker/envguard/ /tmp/envguard/

Docker's COPY instruction cannot reference paths outside the build context using ../. This will fail with:

COPY failed: forbidden path outside the build context

Fix: Change to COPY docker/envguard/ /tmp/envguard/ (assuming build context is the repo root).


⚠️ Design Considerations

1. Return value lifetime after unlock

pthread_rwlock_rdlock(&g_env_lock);
char *value = fn(name);
pthread_rwlock_unlock(&g_env_lock);
return value;  // ← Pointer to internal storage

After releasing the lock, another thread could call setenv()/unsetenv() and invalidate the returned pointer. This is a fundamental POSIX limitation that the shim cannot fully solve—it reduces the race window but doesn't eliminate it. Consider adding a comment documenting this for future maintainers.

2. resolve_lazy concurrent initialization

Multiple threads could race through the NULL check and call dlsym() simultaneously:

if (sym == NULL) {
    sym = dlsym(RTLD_NEXT, name);  // Multiple threads can reach here

While glibc's dlsym is thread-safe, redundant lookups waste cycles. Using pthread_once would be more robust, though the current implementation is functionally correct.


💡 Minor Suggestions

  1. errno preservation: Lock operations may set errno; consider saving/restoring it:

    int saved_errno = errno;
    pthread_rwlock_wrlock(&g_env_lock);
    int rc = fn(name, value, overwrite);
    int fn_errno = errno;
    pthread_rwlock_unlock(&g_env_lock);
    errno = fn_errno;
    return rc;
  2. Test coverage gaps: The smoke test exercises getenv/setenv/unsetenv but not putenv, clearenv, or secure_getenv. Consider extending coverage.

  3. Verification helper: The README could include a snippet showing how to verify the shim is active at runtime:

    cat /proc/self/maps | grep libenvguard  # Inside container
  4. Build hardening: Consider adding -Werror and -D_FORTIFY_SOURCE=2 to CFLAGS for CI builds.


✅ What's Good

  • rwlock design: Using pthread_rwlock instead of a mutex allows concurrent reads—smart choice for the common case.
  • Lazy fallback: The atomic double-check handles edge cases where wrappers are called before the constructor.
  • Comprehensive README: Clear explanation of the problem, solution, and caveats.
  • BSD-3-Clause license: Consistent with the project.

Verdict: The approach is sound and solves a real problem. Please fix the COPY path issue—the Docker build will fail otherwise. The other items are suggestions for hardening that can land in follow-up PRs.


Update (1afb539): 👍 Good improvement switching from ENV LD_PRELOAD to /etc/ld.so.preload. This correctly handles Isaac Sim's python.sh overwriting LD_PRELOAD. Documentation updates are clear and thorough.

The COPY ../docker/envguard/ path issue remains—please address before merge.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR introduces libenvguard.so, a tiny LD_PRELOAD shim that serializes glibc's environment functions (getenv, setenv, putenv, unsetenv, secure_getenv, clearenv) behind a process-wide pthread_rwlock_t to fix intermittent SIGSEGV crashes at libc.so.6!getenv+0x56 in CI, caused by Carb/Kit/Omni plugins calling getenv() from worker threads while the environment is being mutated concurrently.

  • docker/envguard/envguard.c implements the rwlock shim with constructor-time symbol resolution and a lazy-init fallback; the locking strategy (readers share, writers exclude) is correct for the stated purpose.
  • docker/Dockerfile.base and source/isaaclab/test/install_ci/Dockerfile.installci both build, install, and activate the shim via ENV LD_PRELOAD=..., but set LD_PRELOAD unconditionally rather than prepending, which would silently drop any preloads already set by the Isaac Sim base image.

Confidence Score: 3/5

The shim logic is sound but both Dockerfiles unconditionally replace LD_PRELOAD rather than prepending, which could silently break Isaac Sim's own preloaded libraries at runtime.

The C shim correctly serializes all six env functions and the Makefile and test are well structured. The integration points in both Dockerfiles share a common flaw: ENV LD_PRELOAD=/usr/local/lib/libenvguard.so overwrites any preload the Isaac Sim base image may have already established. If the base image ships its own shims, dropping them silently could break GPU/CUDA functionality or other Omniverse internals — something that would only surface at runtime, not at image build time.

docker/Dockerfile.base and source/isaaclab/test/install_ci/Dockerfile.installci both need the LD_PRELOAD assignment changed to prepend rather than replace.

Important Files Changed

Filename Overview
docker/Dockerfile.base Adds build-and-install of libenvguard.so and sets LD_PRELOAD globally; LD_PRELOAD is set unconditionally and may clobber existing preloads from the Isaac Sim base image.
docker/envguard/envguard.c New LD_PRELOAD shim wrapping all glibc env functions with a process-wide rwlock; locking is correct but the returned getenv pointer is only valid during the lock hold, not after.
docker/envguard/Makefile Standard Makefile for building and installing libenvguard.so; all targets (all, install, check, clean) are correctly defined.
docker/envguard/test_envguard.c Concurrency smoke test with 8 reader / 4 writer threads; covers the critical getenv+setenv race but the post-return pointer dereference (strlen on returned value after unlock) is technically still racy without the shim's guarantees.
source/isaaclab/test/install_ci/Dockerfile.installci Adds envguard install inline inside ISAACLAB_PATH (leaving build artifacts) and sets LD_PRELOAD unconditionally, sharing the same clobbering risk as Dockerfile.base.
docker/envguard/README.md Clear documentation of the problem, solution, build steps, and caveats; post-return pointer staleness could be called out more explicitly in the Caveats section.

Sequence Diagram

sequenceDiagram
    participant RT as Runtime Process
    participant EG as libenvguard.so (LD_PRELOAD)
    participant GL as glibc (RTLD_NEXT)

    Note over RT,GL: Library constructor runs at load time
    EG->>GL: "dlsym(RTLD_NEXT, getenv) -> real_getenv"
    EG->>GL: "dlsym(RTLD_NEXT, setenv) -> real_setenv"
    EG->>GL: "dlsym(RTLD_NEXT, clearenv) -> real_clearenv"

    par Reader thread (Carb/Kit/Omni plugin)
        RT->>EG: getenv(PATH)
        EG->>EG: pthread_rwlock_rdlock
        EG->>GL: "real_getenv(PATH) -> ptr"
        EG->>EG: pthread_rwlock_unlock
        EG-->>RT: return ptr (valid only until next writer)
    and Writer thread (Python startup / another plugin)
        RT->>EG: setenv(FOO, bar, 1)
        EG->>EG: pthread_rwlock_wrlock blocks until readers done
        EG->>GL: real_setenv(FOO, bar, 1)
        EG->>EG: pthread_rwlock_unlock
        EG-->>RT: return 0
    end
Loading

Reviews (1): Last reviewed commit: "Make env* functions thread safe" | Re-trigger Greptile

Comment thread docker/Dockerfile.base Outdated
COPY ../docker/envguard/ /tmp/envguard/
RUN make -C /tmp/envguard install PREFIX=/usr/local && \
rm -rf /tmp/envguard
ENV LD_PRELOAD=/usr/local/lib/libenvguard.so
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 LD_PRELOAD clobbers any existing preload from the base image

ENV LD_PRELOAD=/usr/local/lib/libenvguard.so unconditionally replaces the variable rather than prepending to it. The Isaac Sim base image (${ISAACSIM_BASE_IMAGE_ARG}) is a third-party NVIDIA container that may already export LD_PRELOAD for its own CUDA or Omniverse shims. Replacing the value would silently drop those libraries for every RUN step and in the final container, potentially breaking unrelated Isaac Sim functionality. The safe pattern is to prepend only: ENV LD_PRELOAD=/usr/local/lib/libenvguard.so:${LD_PRELOAD} — Docker expands ${LD_PRELOAD} from the layer environment at build time, so if the base image exported it, it will be preserved; if it was empty, a trailing : is harmless to the dynamic linker.

Comment on lines 62 to 63
# Fix line endings in .sh files (Win git may add \r)
RUN find ${ISAACLAB_PATH} -type f -name "*.sh" -exec sed -i 's/\r$//' {} +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 LD_PRELOAD clobbers existing preloads, and build artifacts left in-place

Same clobbering risk as Dockerfile.baseENV LD_PRELOAD=/usr/local/lib/libenvguard.so replaces rather than prepends. Additionally, unlike the Dockerfile.base variant (which copies to /tmp and removes the tree), make install here runs directly inside ${ISAACLAB_PATH}/docker/envguard/, leaving libenvguard.so and any object files as build artifacts inside the image's ISAACLAB_PATH. A trailing && make -C ${ISAACLAB_PATH}/docker/envguard clean would keep the image tidy.

Comment on lines +106 to +116
char *getenv(const char *name)
{
getenv_fn_t fn = (getenv_fn_t) resolve_lazy((void **) &real_getenv, "getenv");
if (fn == NULL) {
return NULL;
}
pthread_rwlock_rdlock(&g_env_lock);
char *value = fn(name);
pthread_rwlock_unlock(&g_env_lock);
return value;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 getenv return value can be invalidated the moment the lock drops

The read lock is released before the pointer is returned to the caller. Any writer thread that calls setenv(), unsetenv(), or clearenv() immediately after will acquire the write lock and may reallocate or free the __environ array that the returned char * points into. Callers that store the result and dereference it outside of a subsequent locked region are still racy. The README caveats section only mentions the __environ/environ direct-access case; it is worth adding a note stating explicitly that the returned pointer from getenv/secure_getenv is only valid until the next write-side env function call on another thread, and callers should strdup if they need it to outlive that window.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design will still leave a race condition since it won't generally be possible to locate and lock around all of the write-side calls (ie: calls in third party libraries such as Python).

The read wrappers would instead need to return a copy of the string, not the raw pointer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, and the original comment undersold the severity. Documentation can't fix this because, as you point out, write-side calls in third-party code (Python's os.environ, putenv inside native extensions, etc.) will never be coordinated with callers of the read wrappers.

Returning a strdup'd copy is the correct fix, but it breaks the POSIX getenv API contract where callers assume they do not own the pointer. A practical middle ground is a thread-local rotating buffer, similar to how glibc implements strerror. Each thread gets its own small pool of slots; the wrapper copies the value in, returns that pointer, and the memory is only reclaimed when the same thread reuses the slot. This keeps the caller-doesn't-free contract intact for the common case while eliminating the cross-thread race:

#define TLS_BUF_SLOTS  4
#define TLS_BUF_SIZE   256

static char *getenv_tls_copy(const char *val) {
    static __thread char bufs[TLS_BUF_SLOTS][TLS_BUF_SIZE];
    static __thread int  idx = 0;
    if (val == NULL) return NULL;
    char *dst = bufs[idx++ % TLS_BUF_SLOTS];
    strncpy(dst, val, TLS_BUF_SIZE - 1);
    dst[TLS_BUF_SIZE - 1] = '\0';
    return dst;
}

Then in the wrapper:

char *getenv(const char *name) {
    getenv_fn_t fn = (getenv_fn_t) resolve_lazy((void **) &real_getenv, "getenv");
    if (fn == NULL) return NULL;
    pthread_rwlock_rdlock(&g_env_lock);
    char *value = getenv_tls_copy(fn(name));  // copy while lock is held
    pthread_rwlock_unlock(&g_env_lock);
    return value;
}

The fixed buffer size is the main caveat — PATH and similar variables can exceed 256 bytes. For production use you'd want to either make the slots heap-allocated (with a free on slot reuse) or fall back to strdup for oversized values and document that those do need to be freed. The secure_getenv wrapper needs the same treatment.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — you're right that the returned char* can be invalidated after the read lock is released if a concurrent writer reallocates the environ array or overwrites the value in-place.

A strdup() before releasing the lock would close that window:

pthread_rwlock_rdlock(&g_env_lock);
char *value = fn(name);
char *copy = value ? strdup(value) : NULL;
pthread_rwlock_unlock(&g_env_lock);
return copy;

The tradeoff is that callers now own heap memory they won't free — POSIX getenv() returns a pointer that callers are not expected to free(), so this introduces a small per-call leak. In practice most getenv() call sites read the value once and never touch the pointer again, so the leak is bounded, but it's a semantic contract change worth noting.

That said, the practical risk profile here is:

  1. The shim already eliminates the SIGSEGV-during-lookup crash (the __environ array realloc race) which is the failure mode hitting CI.
  2. The use-after-release race you describe requires a writer to fire in the narrow window between the caller receiving the pointer and finishing its read — a much smaller window than the unprotected lookup, but still real.

Returning a copy is the more correct solution. Worth adding — I'll leave it to @pbarejko to decide if they want to take that on in this PR or as a follow-up.

Comment on lines +106 to +116
char *getenv(const char *name)
{
getenv_fn_t fn = (getenv_fn_t) resolve_lazy((void **) &real_getenv, "getenv");
if (fn == NULL) {
return NULL;
}
pthread_rwlock_rdlock(&g_env_lock);
char *value = fn(name);
pthread_rwlock_unlock(&g_env_lock);
return value;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design will still leave a race condition since it won't generally be possible to locate and lock around all of the write-side calls (ie: calls in third party libraries such as Python).

The read wrappers would instead need to return a copy of the string, not the raw pointer.

Comment on lines +6 to +7
// envguard: preload shim that makes glibc's environment functions
// thread-safe relative to each other.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this might work in some cases, it is still not guaranteed that even with this library preloaded that all libraries will pick up these wrapped versions of the env functions. This would also be error prone since it needs to be explicitly preloaded into the process for it to work.

Even hooking these functions directly (ie: with something like Detours on Windows) would still leave the same potential data race accessing the returned string pointer is another thread reallocates the env block before the get-caller accesses the returned pointer. Working around that would require the use of TLS to maintain a scratch buffer on each thread that calls any of the env read functions. The read functions would need to copy the returned string (from the real getenv() call) while still inside the lock, then return the pointer to the copied string instead. The TLS buffers would also need to be cleaned up on thread exit to avoid a memory leak.

@pbarejko pbarejko force-pushed the pbarejko/thread-safe-env branch from 1afb539 to 2ecbaac Compare May 26, 2026 21:16
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Update (2ecbaac)

This update includes significant changes that address CI/testing concerns and documentation improvements:

Key Changes Reviewed

CI/Workflow Improvements:

  • exclude-pattern input added to run-package-tests and run-tests actions, allowing comma-separated test exclusions
  • Rendering correctness tests (test_rendering_*.py) removed from TESTS_TO_SKIP in tools/test_settings.py, indicating these now run in standard CI jobs
  • Non-root verification jobs folded into test-isaaclab-ov and test-curobo jobs to reduce runner costs
  • JUnit XML reports now uploaded as artifacts for better test visibility

envguard Thread-Safety Shim (Core Addition):

  • New docker/envguard/ directory with an LD_PRELOAD shim (libenvguard.so) that serializes glibc environment functions (getenv/setenv/putenv/unsetenv/secure_getenv/clearenv) behind an rwlock
  • Addresses intermittent CI SIGSEGVs from Carb/Kit/Omni plugins calling getenv from worker threads while environment is mutated
  • Uses /etc/ld.so.preload activation (not LD_PRELOAD env var) because Isaac Sim's python.sh overwrites LD_PRELOAD
  • Includes comprehensive README, Makefile, and smoke test

Documentation Reorganization:

  • source/features/visualization.rst moved to source/overview/core-concepts/visualization.rst
  • Experimental features documentation restructured with new isaaclab_contrib package overview
  • RLinf VLA post-training docs moved to separate page (rlinf_vla_posttraining.rst)
  • Environment presets documentation clarified with labeled groups (physics=/renderer=/presets=)

Test Fixes:

  • test_noise.py: Fixed potential NaN issues when random data is exactly 0 with 'scale' operation
  • Rendering test utilities: Added OVRTX renderer log redirection to stdout for pytest capture

Minor Fixes:

  • OVRTXRendererCfg.log_file_path now uses cross-platform temp directory instead of Linux-specific /tmp
  • Changelog entries processed for isaaclab_ov (0.4.1) and isaaclab_teleop (0.5.1)

Assessment

The envguard shim is a well-documented, defensive solution to a genuine glibc thread-safety gap. The CI improvements reduce redundancy while maintaining coverage. Documentation changes improve organization.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants