Make env* functions thread safe#5735
Conversation
There was a problem hiding this comment.
Code Review Summary
Thanks for tackling this intermittent CI crash! The LD_PRELOAD approach is the standard workaround for third-party code calling non-thread-safe glibc functions—ASAN uses the same technique.
🔴 Critical Issue
Invalid COPY path in Dockerfile.base
COPY ../docker/envguard/ /tmp/envguard/Docker's COPY instruction cannot reference paths outside the build context using ../. This will fail with:
COPY failed: forbidden path outside the build context
Fix: Change to COPY docker/envguard/ /tmp/envguard/ (assuming build context is the repo root).
⚠️ Design Considerations
1. Return value lifetime after unlock
pthread_rwlock_rdlock(&g_env_lock);
char *value = fn(name);
pthread_rwlock_unlock(&g_env_lock);
return value; // ← Pointer to internal storageAfter releasing the lock, another thread could call setenv()/unsetenv() and invalidate the returned pointer. This is a fundamental POSIX limitation that the shim cannot fully solve—it reduces the race window but doesn't eliminate it. Consider adding a comment documenting this for future maintainers.
2. resolve_lazy concurrent initialization
Multiple threads could race through the NULL check and call dlsym() simultaneously:
if (sym == NULL) {
sym = dlsym(RTLD_NEXT, name); // Multiple threads can reach hereWhile glibc's dlsym is thread-safe, redundant lookups waste cycles. Using pthread_once would be more robust, though the current implementation is functionally correct.
💡 Minor Suggestions
-
errno preservation: Lock operations may set
errno; consider saving/restoring it:int saved_errno = errno; pthread_rwlock_wrlock(&g_env_lock); int rc = fn(name, value, overwrite); int fn_errno = errno; pthread_rwlock_unlock(&g_env_lock); errno = fn_errno; return rc;
-
Test coverage gaps: The smoke test exercises
getenv/setenv/unsetenvbut notputenv,clearenv, orsecure_getenv. Consider extending coverage. -
Verification helper: The README could include a snippet showing how to verify the shim is active at runtime:
cat /proc/self/maps | grep libenvguard # Inside container
-
Build hardening: Consider adding
-Werrorand-D_FORTIFY_SOURCE=2toCFLAGSfor CI builds.
✅ What's Good
- rwlock design: Using
pthread_rwlockinstead of a mutex allows concurrent reads—smart choice for the common case. - Lazy fallback: The atomic double-check handles edge cases where wrappers are called before the constructor.
- Comprehensive README: Clear explanation of the problem, solution, and caveats.
- BSD-3-Clause license: Consistent with the project.
Verdict: The approach is sound and solves a real problem. Please fix the COPY path issue—the Docker build will fail otherwise. The other items are suggestions for hardening that can land in follow-up PRs.
Update (1afb539): 👍 Good improvement switching from ENV LD_PRELOAD to /etc/ld.so.preload. This correctly handles Isaac Sim's python.sh overwriting LD_PRELOAD. Documentation updates are clear and thorough.
The COPY ../docker/envguard/ path issue remains—please address before merge.
Greptile SummaryThis PR introduces
Confidence Score: 3/5The shim logic is sound but both Dockerfiles unconditionally replace LD_PRELOAD rather than prepending, which could silently break Isaac Sim's own preloaded libraries at runtime. The C shim correctly serializes all six env functions and the Makefile and test are well structured. The integration points in both Dockerfiles share a common flaw: ENV LD_PRELOAD=/usr/local/lib/libenvguard.so overwrites any preload the Isaac Sim base image may have already established. If the base image ships its own shims, dropping them silently could break GPU/CUDA functionality or other Omniverse internals — something that would only surface at runtime, not at image build time. docker/Dockerfile.base and source/isaaclab/test/install_ci/Dockerfile.installci both need the LD_PRELOAD assignment changed to prepend rather than replace. Important Files Changed
Sequence DiagramsequenceDiagram
participant RT as Runtime Process
participant EG as libenvguard.so (LD_PRELOAD)
participant GL as glibc (RTLD_NEXT)
Note over RT,GL: Library constructor runs at load time
EG->>GL: "dlsym(RTLD_NEXT, getenv) -> real_getenv"
EG->>GL: "dlsym(RTLD_NEXT, setenv) -> real_setenv"
EG->>GL: "dlsym(RTLD_NEXT, clearenv) -> real_clearenv"
par Reader thread (Carb/Kit/Omni plugin)
RT->>EG: getenv(PATH)
EG->>EG: pthread_rwlock_rdlock
EG->>GL: "real_getenv(PATH) -> ptr"
EG->>EG: pthread_rwlock_unlock
EG-->>RT: return ptr (valid only until next writer)
and Writer thread (Python startup / another plugin)
RT->>EG: setenv(FOO, bar, 1)
EG->>EG: pthread_rwlock_wrlock blocks until readers done
EG->>GL: real_setenv(FOO, bar, 1)
EG->>EG: pthread_rwlock_unlock
EG-->>RT: return 0
end
Reviews (1): Last reviewed commit: "Make env* functions thread safe" | Re-trigger Greptile |
| COPY ../docker/envguard/ /tmp/envguard/ | ||
| RUN make -C /tmp/envguard install PREFIX=/usr/local && \ | ||
| rm -rf /tmp/envguard | ||
| ENV LD_PRELOAD=/usr/local/lib/libenvguard.so |
There was a problem hiding this comment.
LD_PRELOAD clobbers any existing preload from the base image
ENV LD_PRELOAD=/usr/local/lib/libenvguard.so unconditionally replaces the variable rather than prepending to it. The Isaac Sim base image (${ISAACSIM_BASE_IMAGE_ARG}) is a third-party NVIDIA container that may already export LD_PRELOAD for its own CUDA or Omniverse shims. Replacing the value would silently drop those libraries for every RUN step and in the final container, potentially breaking unrelated Isaac Sim functionality. The safe pattern is to prepend only: ENV LD_PRELOAD=/usr/local/lib/libenvguard.so:${LD_PRELOAD} — Docker expands ${LD_PRELOAD} from the layer environment at build time, so if the base image exported it, it will be preserved; if it was empty, a trailing : is harmless to the dynamic linker.
| # Fix line endings in .sh files (Win git may add \r) | ||
| RUN find ${ISAACLAB_PATH} -type f -name "*.sh" -exec sed -i 's/\r$//' {} + |
There was a problem hiding this comment.
LD_PRELOAD clobbers existing preloads, and build artifacts left in-place
Same clobbering risk as Dockerfile.base — ENV LD_PRELOAD=/usr/local/lib/libenvguard.so replaces rather than prepends. Additionally, unlike the Dockerfile.base variant (which copies to /tmp and removes the tree), make install here runs directly inside ${ISAACLAB_PATH}/docker/envguard/, leaving libenvguard.so and any object files as build artifacts inside the image's ISAACLAB_PATH. A trailing && make -C ${ISAACLAB_PATH}/docker/envguard clean would keep the image tidy.
| char *getenv(const char *name) | ||
| { | ||
| getenv_fn_t fn = (getenv_fn_t) resolve_lazy((void **) &real_getenv, "getenv"); | ||
| if (fn == NULL) { | ||
| return NULL; | ||
| } | ||
| pthread_rwlock_rdlock(&g_env_lock); | ||
| char *value = fn(name); | ||
| pthread_rwlock_unlock(&g_env_lock); | ||
| return value; | ||
| } |
There was a problem hiding this comment.
getenv return value can be invalidated the moment the lock drops
The read lock is released before the pointer is returned to the caller. Any writer thread that calls setenv(), unsetenv(), or clearenv() immediately after will acquire the write lock and may reallocate or free the __environ array that the returned char * points into. Callers that store the result and dereference it outside of a subsequent locked region are still racy. The README caveats section only mentions the __environ/environ direct-access case; it is worth adding a note stating explicitly that the returned pointer from getenv/secure_getenv is only valid until the next write-side env function call on another thread, and callers should strdup if they need it to outlive that window.
There was a problem hiding this comment.
this design will still leave a race condition since it won't generally be possible to locate and lock around all of the write-side calls (ie: calls in third party libraries such as Python).
The read wrappers would instead need to return a copy of the string, not the raw pointer.
There was a problem hiding this comment.
You're right, and the original comment undersold the severity. Documentation can't fix this because, as you point out, write-side calls in third-party code (Python's os.environ, putenv inside native extensions, etc.) will never be coordinated with callers of the read wrappers.
Returning a strdup'd copy is the correct fix, but it breaks the POSIX getenv API contract where callers assume they do not own the pointer. A practical middle ground is a thread-local rotating buffer, similar to how glibc implements strerror. Each thread gets its own small pool of slots; the wrapper copies the value in, returns that pointer, and the memory is only reclaimed when the same thread reuses the slot. This keeps the caller-doesn't-free contract intact for the common case while eliminating the cross-thread race:
#define TLS_BUF_SLOTS 4
#define TLS_BUF_SIZE 256
static char *getenv_tls_copy(const char *val) {
static __thread char bufs[TLS_BUF_SLOTS][TLS_BUF_SIZE];
static __thread int idx = 0;
if (val == NULL) return NULL;
char *dst = bufs[idx++ % TLS_BUF_SLOTS];
strncpy(dst, val, TLS_BUF_SIZE - 1);
dst[TLS_BUF_SIZE - 1] = '\0';
return dst;
}Then in the wrapper:
char *getenv(const char *name) {
getenv_fn_t fn = (getenv_fn_t) resolve_lazy((void **) &real_getenv, "getenv");
if (fn == NULL) return NULL;
pthread_rwlock_rdlock(&g_env_lock);
char *value = getenv_tls_copy(fn(name)); // copy while lock is held
pthread_rwlock_unlock(&g_env_lock);
return value;
}The fixed buffer size is the main caveat — PATH and similar variables can exceed 256 bytes. For production use you'd want to either make the slots heap-allocated (with a free on slot reuse) or fall back to strdup for oversized values and document that those do need to be freed. The secure_getenv wrapper needs the same treatment.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.
There was a problem hiding this comment.
Valid point — you're right that the returned char* can be invalidated after the read lock is released if a concurrent writer reallocates the environ array or overwrites the value in-place.
A strdup() before releasing the lock would close that window:
pthread_rwlock_rdlock(&g_env_lock);
char *value = fn(name);
char *copy = value ? strdup(value) : NULL;
pthread_rwlock_unlock(&g_env_lock);
return copy;The tradeoff is that callers now own heap memory they won't free — POSIX getenv() returns a pointer that callers are not expected to free(), so this introduces a small per-call leak. In practice most getenv() call sites read the value once and never touch the pointer again, so the leak is bounded, but it's a semantic contract change worth noting.
That said, the practical risk profile here is:
- The shim already eliminates the SIGSEGV-during-lookup crash (the
__environarray realloc race) which is the failure mode hitting CI. - The use-after-release race you describe requires a writer to fire in the narrow window between the caller receiving the pointer and finishing its read — a much smaller window than the unprotected lookup, but still real.
Returning a copy is the more correct solution. Worth adding — I'll leave it to @pbarejko to decide if they want to take that on in this PR or as a follow-up.
| char *getenv(const char *name) | ||
| { | ||
| getenv_fn_t fn = (getenv_fn_t) resolve_lazy((void **) &real_getenv, "getenv"); | ||
| if (fn == NULL) { | ||
| return NULL; | ||
| } | ||
| pthread_rwlock_rdlock(&g_env_lock); | ||
| char *value = fn(name); | ||
| pthread_rwlock_unlock(&g_env_lock); | ||
| return value; | ||
| } |
There was a problem hiding this comment.
this design will still leave a race condition since it won't generally be possible to locate and lock around all of the write-side calls (ie: calls in third party libraries such as Python).
The read wrappers would instead need to return a copy of the string, not the raw pointer.
| // envguard: preload shim that makes glibc's environment functions | ||
| // thread-safe relative to each other. |
There was a problem hiding this comment.
while this might work in some cases, it is still not guaranteed that even with this library preloaded that all libraries will pick up these wrapped versions of the env functions. This would also be error prone since it needs to be explicitly preloaded into the process for it to work.
Even hooking these functions directly (ie: with something like Detours on Windows) would still leave the same potential data race accessing the returned string pointer is another thread reallocates the env block before the get-caller accesses the returned pointer. Working around that would require the use of TLS to maintain a scratch buffer on each thread that calls any of the env read functions. The read functions would need to copy the returned string (from the real getenv() call) while still inside the lock, then return the pointer to the copied string instead. The TLS buffers would also need to be cleaned up on thread exit to avoid a memory leak.
1afb539 to
2ecbaac
Compare
There was a problem hiding this comment.
Review Update (2ecbaac)
This update includes significant changes that address CI/testing concerns and documentation improvements:
Key Changes Reviewed
CI/Workflow Improvements:
exclude-patterninput added to run-package-tests and run-tests actions, allowing comma-separated test exclusions- Rendering correctness tests (
test_rendering_*.py) removed fromTESTS_TO_SKIPintools/test_settings.py, indicating these now run in standard CI jobs - Non-root verification jobs folded into test-isaaclab-ov and test-curobo jobs to reduce runner costs
- JUnit XML reports now uploaded as artifacts for better test visibility
envguard Thread-Safety Shim (Core Addition):
- New
docker/envguard/directory with an LD_PRELOAD shim (libenvguard.so) that serializes glibc environment functions (getenv/setenv/putenv/unsetenv/secure_getenv/clearenv) behind an rwlock - Addresses intermittent CI SIGSEGVs from Carb/Kit/Omni plugins calling getenv from worker threads while environment is mutated
- Uses
/etc/ld.so.preloadactivation (notLD_PRELOADenv var) because Isaac Sim'spython.shoverwritesLD_PRELOAD - Includes comprehensive README, Makefile, and smoke test
Documentation Reorganization:
source/features/visualization.rstmoved tosource/overview/core-concepts/visualization.rst- Experimental features documentation restructured with new
isaaclab_contribpackage overview - RLinf VLA post-training docs moved to separate page (
rlinf_vla_posttraining.rst) - Environment presets documentation clarified with labeled groups (physics=/renderer=/presets=)
Test Fixes:
test_noise.py: Fixed potential NaN issues when random data is exactly 0 with 'scale' operation- Rendering test utilities: Added OVRTX renderer log redirection to stdout for pytest capture
Minor Fixes:
OVRTXRendererCfg.log_file_pathnow uses cross-platform temp directory instead of Linux-specific/tmp- Changelog entries processed for isaaclab_ov (0.4.1) and isaaclab_teleop (0.5.1)
Assessment
The envguard shim is a well-documented, defensive solution to a genuine glibc thread-safety gap. The CI improvements reduce redundancy while maintaining coverage. Documentation changes improve organization.
LGTM ✅
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Screenshots
Please attach before and after screenshots of the change if applicable.
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there