Skip to content

Decouple backend-specific cloning from InteractiveScene#5770

Open
ooctipus wants to merge 6 commits into
isaac-sim:developfrom
ooctipus:feature/clone_plan_new_representation
Open

Decouple backend-specific cloning from InteractiveScene#5770
ooctipus wants to merge 6 commits into
isaac-sim:developfrom
ooctipus:feature/clone_plan_new_representation

Conversation

@ooctipus
Copy link
Copy Markdown
Collaborator

Summary

This PR removes backend-specific cloning logic from InteractiveScene and
replaces the previous implicit replicate_session_defaults /
replicate_session machinery with three orthogonal, explicit primitives that
compose identically whether you drive cloning through InteractiveScene or
by hand in a DirectRLEnv or standalone script.

The result is that InteractiveScene no longer knows anything about USD vs.
PhysX vs. Newton — it just enters a ReplicateSession and lets each asset's
constructor register the backend(s) it needs. This is foundational for two
follow-on capabilities the project has wanted for a while: flexible backend
cloning
and skip-cloning workflows.

What changed

New core primitives in isaaclab.cloner

  • REPLICATION_QUEUE — module-level list that asset constructors append
    (cfg, BackendCtxCls) pairs to via tiny per-backend helpers
    (queue_usd_replication, queue_physx_replication,
    queue_newton_replication). Backends are no longer special-cased inside
    InteractiveScene; each one self-registers.
  • ClonePlan — self-contained dataclass describing the world layout
    (sources, destinations, clone_mask, env_ids, positions,
    cfg_rows). Stage-agnostic by design; the USD stage is now passed
    explicitly to consumers so the same plan can be replayed, inspected, or
    serialized.
  • replicate(plan, *, stage) — free function that drains
    REPLICATION_QUEUE against a plan, groups queued cfgs by backend context
    class, runs each context in ascending replicate_priority order (physics
    before USD), publishes the plan to SimulationContext, and clears the
    queue. The queue is snapshotted and cleared up front so a backend failure
    cannot leak stale entries into the next call.
  • ReplicateSession is now a thin context manager that calls
    make_clone_plan in __enter__ and replicate in __exit__. The
    state-bag version with plan / stage / cfg_rows /
    replicate_on_exit fields is gone.
  • ClonePlan.from_env_0 — classmethod that builds the single-source
    homogeneous plan most direct envs need by auto-populating cfg_rows
    from REPLICATION_QUEUE filtered by env-root prefix.
  • CloneCfg.clone_regex (default "/World/envs/env_.*") — single
    source of truth for the env-namespace convention. InteractiveScene
    reads it directly when expanding {ENV_REGEX_NS} cfg macros.

Two equivalent invocation paths

# InteractiveScene path (what the scene runs under the hood)
with cloner.ReplicateSession(cfgs, num_clones=N, env_spacing=2.0,
                             device=device, stage=stage):
    for cfg in cfgs:
        cfg.class_type(cfg)

# Direct env / script path
plan = cloner.ClonePlan.from_env_0(src, dest, num_envs, device, positions)
cloner.replicate(plan, stage=scene.stage)

Both end in the same cloner.replicate(plan, stage=...) call. The only
difference is how the plan was built and how asset construction was
interleaved.

What got removed from InteractiveScene

  • clone_environments(...) deprecated shim. The scene now replicates
    inside __init__ via ReplicateSession.
  • env_ns / env_regex_ns properties (used only internally).
  • _build_clone_plan_from_cfg and _default_env_origins internals.
    Cfg-driven plan construction now lives in make_clone_plan; per-env
    positions are read from the published ClonePlan.
  • InteractiveScene.env_origins now reads from the plan published to
    SimulationContext, making the plan the single source of truth for
    env placement.

Why this matters (the actual point of the PR)

This refactor is foundational for two capabilities the current scene
coupling blocks:

  • Flexible backend cloning. Backends now plug in by shipping a
    <Backend>ReplicateContext class + a one-line queue helper. Swapping
    PhysX ↔ Newton no longer requires InteractiveScene to change; cfgs
    and user code stay untouched, and a third-party backend can register
    itself without modifying core.
  • Skip-cloning workflows. Because plan construction, asset
    registration, and drain are three independent primitives, callers
    that want to author env-0 prims by hand and skip the cloner — or
    drive replication out-of-band from a visualizer, replay tool, or
    test fixture — can do so without fighting InteractiveScene.

Migration notes

  • with cloner.ReplicateSession(): (no-arg) →
    cloner.replicate(cloner.ClonePlan.from_env_0(...), stage=...).
  • InteractiveScene.clone_environments(...) → removed; the scene
    replicates inside __init__.
  • make_clone_plan(sources, destinations, ...)
    make_clone_plan(cfgs, num_clones, env_spacing, device, ...).
  • Pass stage=... explicitly to replicate() and ReplicateSession().
  • Read CloneCfg.clone_regex if you previously used
    InteractiveScene.env_ns / env_regex_ns.

About 17 direct envs were migrated to the new pattern in this PR.

Test plan

  • ./isaaclab.sh -p -m pytest source/isaaclab/test/sim/test_cloner.py
  • ./isaaclab.sh -p -m pytest source/isaaclab/test/scene/test_interactive_scene.py
  • ./isaaclab.sh -p -m pytest source/isaaclab_physx/test/sim/test_cloner.py
  • Migrated direct envs (cartpole, anymal_c, franka_cabinet,
    factory, humanoid_amp, inhand_manipulation, locomotion,
    quadcopter, shadow_hand_*, automate/*, cart_double_pendulum,
    cartpole_warp, inhand_manipulation_warp, locomotion_warp)
    spawn and step on PhysX
  • Same envs spawn and step on Newton
  • ./isaaclab.sh -f passes
  • Docs build (cd docs && make html)

@ooctipus ooctipus changed the base branch from main to develop May 25, 2026 11:10
@github-actions github-actions Bot added documentation Improvements or additions to documentation enhancement New feature or request isaac-sim Related to Isaac Sim team isaac-mimic Related to Isaac Mimic team infrastructure labels May 25, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — Multi-Perspective Analysis

PR: Decouple backend-specific cloning from InteractiveScene
SHA: e75f1bc8e6d3cdc07d5bfcbdff33b4c2f2afc14d


Architecture Assessment (Isaac Lab Expert)

This is an excellent structural improvement. The PR replaces InteractiveScene's hard-coded backend dispatching (if physx... elif newton... elif ovphysx...) with a self-registering queue pattern where each backend ships a <Backend>ReplicateContext + a queue_<backend>_replication() helper. The three orthogonal primitives (REPLICATION_QUEUE, ClonePlan, replicate()) compose well regardless of whether you're in InteractiveScene, a DirectRLEnv, or a standalone script.

Design wins:

  • Clean separation of concerns — InteractiveScene now only cares about what to clone, not how
  • Backend extensibility without core modification (Open/Closed principle)
  • Both invocation paths (ReplicateSession context manager vs. manual ClonePlan.from_env_0 + replicate) converge on the same drain function — no hidden divergence
  • The replicate_priority ordering (physics=0, USD=100) ensures correct execution order

Findings

1. 🟡 Performance Regression Risk: disabled_fabric_change_notifies Removed

The old InteractiveScene.__init__ wrapped both the initial usd_replicate and clone_environments calls inside cloner.disabled_fabric_change_notifies(self.stage, restore=False). The new code calls cloner.usd_replicate(...) and the ReplicateSession without this optimization.

The old code's own comments documented that this suspension provides a measurable speedup when cloned prims carry PhysX rigid-body schemas and total Sdf.CopySpec firings reach ~32K. For large scenes (128+ envs, many rigid bodies), this could regress scene-init time.

Suggestion: Consider wrapping the UsdReplicateContext.replicate() method (or the entire ReplicateSession.__exit__ drain) in disabled_fabric_change_notifies so the optimization is preserved in the new architecture. Since the context manager was already moved to _fabric_notices.py, it could be imported from there.

2. 🟡 Module-Level REPLICATION_QUEUE — Implicit Global State

REPLICATION_QUEUE is a module-level mutable list that asset constructors append to and replicate() drains. While the PR handles the "exception during session" case by clearing the queue in ReplicateSession.__exit__, there are still scenarios where the queue could accumulate stale entries:

  • If replicate() is called without a ReplicateSession and a backend raises mid-drain, the queue is already cleared up front (good), but the partial work from earlier backends cannot be rolled back.
  • If user code calls queue_<backend>_replication(cfg) outside of any session/replicate cycle (e.g., in a test or standalone script that forgets to call replicate()), entries accumulate silently until the next replicate() call — possibly in a completely unrelated context.

Suggestion: Consider adding a warnings.warn() or debug log in replicate() when queued contains entries whose cfg is not in plan.cfg_rows (currently silently skipped). This would help developers catch mismatched queue/plan scenarios.

3. 🟡 ClonePlan.from_env_0 Couples to Live Queue State

ClonePlan.from_env_0 reads REPLICATION_QUEUE at call time to auto-populate cfg_rows. This means the return value depends on when it's called relative to asset construction. If called too early (before all assets register) or too late (after a previous replicate() drained the queue), the plan will be incomplete.

The docstring mentions filtering by prefix, but doesn't document this timing constraint. For the InteractiveScene path this is fine (construction happens inside ReplicateSession), but direct-env users calling from_env_0 need to know that queue population must precede this call.

Suggestion: Add a note to the from_env_0 docstring: "Must be called after all asset constructors have registered their replication entries into REPLICATION_QUEUE."

4. 🟢 _collect_asset_cfgs vs _add_entities_from_cfg — Duplicated ENV_REGEX_NS Resolution

Both _collect_asset_cfgs() and _add_entities_from_cfg() resolve {ENV_REGEX_NS} macros in prim_path. _collect_asset_cfgs does it for all children with a prim_path, while _add_entities_from_cfg also does asset_cfg.prim_path.format(ENV_REGEX_NS=env_regex_ns). Since _collect_asset_cfgs already mutates the cfg objects (it writes back to child.prim_path), the second resolution in _add_entities_from_cfg either double-formats (safe if no {ENV_REGEX_NS} remains) or is redundant.

This isn't a bug (double-format of an already-resolved string is a no-op), but it's confusing for maintainers. Consider removing the redundant format in _add_entities_from_cfg or adding a comment explaining the ordering.

5. 🟡 SimulationContext.instance() Crash in replicate()

In replicate(), the final line calls SimulationContext.instance().set_clone_plan(plan). If SimulationContext hasn't been initialized yet (e.g., in a unit test that manually creates a stage), .instance() returns None and this line raises AttributeError: 'NoneType' object has no attribute 'set_clone_plan'.

The direct env / script path example in the PR description shows cloner.replicate(plan, stage=scene.stage) which implies a SimulationContext exists, but the function has no guard.

Suggestion: Add a defensive check: sim = SimulationContext.instance(); if sim is not None: sim.set_clone_plan(plan) — or document that replicate() requires an active SimulationContext.

6. 🟢 Strong Test Coverage for New Primitives

The test additions are solid:

  • test_usd_replicate_context_queue_and_replicate — validates the new UsdReplicateContext
  • test_queue_usd_replication_only_appends — confirms queue-only semantics
  • test_make_clone_plan_homogeneous_returns_env_root_plan — validates the fast-path
  • test_make_clone_plan_heterogeneous_mutates_spawn_paths — validates multi-variant behavior
  • _drain_replication_queue fixture prevents test cross-contamination

Gap: No integration test verifies the full ReplicateSession context-manager flow (enter → construct assets → exit → verify plan published). The unit tests cover individual primitives but not the composed lifecycle that InteractiveScene relies on. Consider adding a test that exercises ReplicateSession end-to-end with a mock backend.


Summary

Category Rating
Architecture ✅ Excellent — clean decoupling, extensible backends
Error Handling ⚠️ Minor gaps (SimulationContext guard, silent queue skipping)
Performance ⚠️ Fabric notice suspension removed — verify no regression
Test Coverage ✅ Good unit tests; integration test for composed lifecycle would strengthen
Migration Safety ✅ Good — legacy free-function wrappers preserved for backward compat

Verdict: Strong architectural improvement. The findings above are advisory — the design is sound and the implementation is clean. The main risk to validate is the Fabric notice suspension removal under real workloads.


Update (f774d4d): Reviewed incremental changes. All new commits are cosmetic — docstring trimming, line reformatting, and removing string-quoted type annotations (e.g. "ClonePlan"ClonePlan). No logic changes, no new issues introduced. Previous findings still apply as-is.


Update (788d71a): Two previous concerns addressed:

  • Finding #1 (Fabric notice suspension): replicate() now wraps copy work in disabled_fabric_change_notifies(self.stage) — performance optimization restored.
  • Finding #3 (ClonePlan.from_env_0 timing): Docstring now documents the ordering constraint (must be called after asset constructors have run).

No new issues introduced. The refactor extracting _apply_queue() is clean.


Update (034506d): Cosmetic changes only — line reformatting in usd.py (multi-line statements condensed to single lines) and added changelog skip marker. No logic changes, no new issues.


Update (de00120): Functional improvements with good test coverage:

  • replicate_session.py: Refactored to group cfgs by backend and deduplicate shared rows before dispatching — prevents redundant instantiation when multiple cfgs share the same row (common in homogeneous plans)
  • sensor_base.py: Now uses sim.get_clone_plan().env_ids for env count when available, handling backends like Newton that clone solver-side without authoring per-env USD specs
  • Added test_replicate_dedupes_shared_rows_across_cfgs regression test validating the dedup logic

No new issues. These are solid bug fixes / edge case handling.


Update (fb61d98): CI config change only — added threedworld.org to the link-checker exclusion list in .github/workflows/check-links.yml. Unrelated to the core PR logic; no code changes.

@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 25, 2026
ooctipus added 4 commits May 25, 2026 04:41
- ruff-format collapsed two multi-line expressions in cloner/usd.py
  back to single lines.
- isaaclab_ovphysx gained changes in this PR (cloner module rewrite)
  but had no changelog fragment; add a .skip mirroring the sibling
  backends.
Host returns HTTP 403 to lychee's user agent on every run, blocking the
"Check for Broken Links" job on unrelated docs (`ecosystem.rst:212`).
Same pattern used for other bot-blocking hosts already in the exclude
list (stackoverflow.com, helm.ngc.nvidia.com, etc.).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request infrastructure isaac-lab Related to Isaac Lab team isaac-mimic Related to Isaac Mimic team isaac-sim Related to Isaac Sim team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant