DexForce · yangchen73 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/configs/agents/rl/basic/cart_pole/train_config.json b/configs/agents/rl/basic/cart_pole/train_config.json
@@ -9,7 +9,7 @@
         "gpu_id": 0,
         "num_envs": 64,
         "iterations": 1000,
-        "rollout_steps": 1024,
+        "buffer_size": 1024,
         "eval_freq": 200,
         "save_freq": 200,
         "use_wandb": false,

diff --git a/configs/agents/rl/basic/cart_pole/train_config_grpo.json b/configs/agents/rl/basic/cart_pole/train_config_grpo.json
@@ -9,7 +9,7 @@
         "gpu_id": 0,
         "num_envs": 64,
         "iterations": 1000,
-        "rollout_steps": 1024,
+        "buffer_size": 1024,
         "eval_freq": 200,
         "save_freq": 200,
         "use_wandb": true,

diff --git a/configs/agents/rl/push_cube/train_config.json b/configs/agents/rl/push_cube/train_config.json
@@ -9,7 +9,7 @@
         "gpu_id": 0,
         "num_envs": 64,
         "iterations": 1000,
-        "rollout_steps": 1024,
+        "buffer_size": 1024,
         "enable_eval": true,
         "num_eval_envs": 16,
         "num_eval_episodes": 3,

diff --git a/docs/source/overview/rl/algorithm.md b/docs/source/overview/rl/algorithm.md
@@ -5,20 +5,17 @@ This module contains the core implementations of reinforcement learning algorith
 ## Main Classes and Functions
 
 ### BaseAlgorithm
-- Abstract base class for RL algorithms, defining common interfaces such as buffer initialization, data collection, and update.
+- Abstract base class for RL algorithms, defining a single update interface over a collected rollout.
 - Key methods:
-  - `initialize_buffer(num_steps, num_envs, obs_dim, action_dim)`: Initialize the trajectory buffer.
-  - `collect_rollout(env, policy, obs, num_steps, on_step_callback)`: Collect interaction data.
-  - `update()`: Update the policy based on collected data.
-- Designed to be algorithm-agnostic; Trainer only depends on this interface to support various RL algorithms.
-- Supports multi-environment parallel collection, compatible with Gymnasium/IsaacGym environments.
+  - `update(rollout)`: Update the policy based on a shared rollout `TensorDict`.
+- Designed to be algorithm-agnostic; `Trainer` handles collection while algorithms focus on loss computation and optimization.
+- Supports multi-environment parallel collection through a shared `[N, T]` rollout `TensorDict`.
 
 ### PPO
 - Mainstream on-policy algorithm, supports Generalized Advantage Estimation (GAE), policy update, and hyperparameter configuration.
 - Key methods:
-  - `_compute_gae(rewards, values, dones)`: Generalized Advantage Estimation.
-  - `collect_rollout`: Collect trajectories and compute advantages/returns.
-  - `update`: Multi-epoch minibatch optimization, including entropy, value, and policy loss, with gradient clipping.
+  - `compute_gae(rollout, gamma, gae_lambda)`: Generalized Advantage Estimation over a shared rollout `TensorDict`.
+  - `update(rollout)`: Multi-epoch minibatch optimization, including entropy, value, and policy loss, with gradient clipping.
 - Supports custom callbacks, detailed logging, and GPU acceleration.
 - Typical training flow: collect rollout → compute advantage/return → multi-epoch minibatch optimization.
 - Supports advantage normalization, entropy regularization, value loss weighting, etc.
@@ -31,8 +28,7 @@ This module contains the core implementations of reinforcement learning algorith
 - Key methods:
   - `_compute_step_returns_and_mask(rewards, dones)`: Step-wise discounted returns and valid-step mask.
   - `_compute_step_group_advantages(step_returns, seq_mask)`: Per-step group normalization with masked mean/std.
-  - `collect_rollout`: Collect trajectories and compute step-wise advantages.
-  - `update`: Multi-epoch minibatch optimization with optional KL penalty.
+  - `update(rollout)`: Multi-epoch minibatch optimization with optional KL penalty.
 - Supports both **Embodied AI** (dense reward, from-scratch training) and **VLA** (sparse reward, fine-tuning) modes via `kl_coef` configuration.
 
 ### Config Classes
@@ -43,19 +39,11 @@ This module contains the core implementations of reinforcement learning algorith
 ## Code Example
 ```python
 class BaseAlgorithm:
-    def initialize_buffer(self, num_steps, num_envs, obs_dim, action_dim):
-        ...
-    def collect_rollout(self, env, policy, obs, num_steps, on_step_callback=None):
-        ...
-    def update(self):
+    def update(self, rollout):
         ...
 
 class PPO(BaseAlgorithm):
-    def _compute_gae(self, rewards, values, dones):
-        ...
-    def collect_rollout(self, ...):
-        ...
-    def update(self):
+    def update(self, rollout):
         ...
 ```
 
@@ -71,10 +59,9 @@ class PPO(BaseAlgorithm):
 - Typical usage:
 ```python
 algo = PPO(cfg, policy)
-buffer = algo.initialize_buffer(...)
-for _ in range(num_iterations):
-    algo.collect_rollout(...)
-    algo.update()
+rollout = collector.collect(buffer_size, rollout=buffer.start_rollout())
+buffer.add(rollout)
+algo.update(buffer.get(flatten=False))
 ```
 
 ---
diff --git a/docs/source/overview/rl/buffer.md b/docs/source/overview/rl/buffer.md
@@ -5,61 +5,64 @@ This module implements the data buffer for RL training, responsible for storing
 ## Main Classes and Structure
 
 ### RolloutBuffer
-- Used for on-policy algorithms (such as PPO, GRPO), efficiently stores observations, actions, rewards, dones, values, and logprobs for each step.
-- Supports multi-environment parallelism (shape: [T, N, ...]), all data allocated on GPU.
+- Used for on-policy algorithms (such as PPO, GRPO), storing a shared rollout `TensorDict` for collector and algorithm stages.
+- Supports multi-environment parallelism with rollout batch shape `[N, T]`, all data allocated on GPU.
 - Structure fields:
-  - `obs`: Observation tensor, float32, shape [T, N, obs_dim]
-  - `actions`: Action tensor, float32, shape [T, N, action_dim]
-  - `rewards`: Reward tensor, float32, shape [T, N]
-  - `dones`: Done flags, bool, shape [T, N]
-  - `values`: Value estimates, float32, shape [T, N]
-  - `logprobs`: Action log probabilities, float32, shape [T, N]
-  - `_extras`: Algorithm-specific fields (e.g., advantages, returns), dict[str, Tensor]
+  - `obs`: Flattened observation tensor, float32, shape `[N, T, obs_dim]`
+  - `action`: Action tensor, float32, shape `[N, T, action_dim]`
+  - `sample_log_prob`: Action log probabilities, float32, shape `[N, T]`
+  - `value`: Value estimates, float32, shape `[N, T]`
+  - `next.reward`: Reward tensor, float32, shape `[N, T]`
+  - `next.done`: Done flags, bool, shape `[N, T]`
+  - `next.terminated`: Termination flags, bool, shape `[N, T]`
+  - `next.truncated`: Truncation flags, bool, shape `[N, T]`
+  - `next.value`: Bootstrap next-state values, float32, shape `[N, T]`
+  - Algorithm-added fields such as `advantage`, `return`, `seq_mask`, and `seq_return`
 
 ## Main Methods
-- `add(obs, action, reward, done, value, logprob)`: Add one step of data.
-- `set_extras(extras)`: Attach algorithm-related tensors (e.g., advantages, returns).
-- `iterate_minibatches(batch_size)`: Randomly sample minibatches, returns dict (including all fields and extras).
-- Supports efficient GPU shuffle and indexing for large-scale training.
+- `start_rollout()`: Returns the shared preallocated rollout `TensorDict` for collector write-in.
+- `add(rollout)`: Marks the shared rollout as ready for consumption.
+- `get(flatten=True)`: Returns the stored rollout, optionally flattened over `[N, T]`.
+- `iterate_minibatches(rollout, batch_size, device)`: Shared batching utility in `buffer/utils.py`.
 
 ## Usage Example
 ```python
-buffer = RolloutBuffer(num_steps, num_envs, obs_dim, action_dim, device)
-for t in range(num_steps):
-    buffer.add(obs, action, reward, done, value, logprob)
-buffer.set_extras({"advantages": adv, "returns": ret})
-for batch in buffer.iterate_minibatches(batch_size):
-    # batch["obs"], batch["actions"], batch["advantages"] ...
+buffer = RolloutBuffer(num_envs, rollout_len, obs_dim, action_dim, device)
+rollout = collector.collect(num_steps=rollout_len, rollout=buffer.start_rollout())
+buffer.add(rollout)
+
+rollout = buffer.get(flatten=False)
+for batch in iterate_minibatches(rollout.reshape(-1), batch_size, device):
+    # batch["obs"], batch["action"], batch["advantage"] ...
     pass
 ```
 
 ## Design and Extension
-- Supports multi-environment parallel collection, compatible with Gymnasium/IsaacGym environments.
-- All data is allocated on GPU to avoid frequent CPU-GPU copying.
-- The extras field can be flexibly extended to meet different algorithm needs (e.g., GAE, TD-lambda, distributional advantages).
-- The iterator automatically shuffles to improve training stability.
-- Compatible with various RL algorithms (PPO, GRPO, A2C, SAC, etc.), custom fields and sampling logic supported.
+- Supports multi-environment parallel collection, compatible with Gymnasium-style vectorized environments.
+- All tensors are preallocated on device to avoid frequent CPU-GPU copying.
+- Algorithm-specific fields are attached directly onto the shared rollout `TensorDict` during optimization.
+- The shared minibatch iterator automatically shuffles flattened rollout entries for PPO/GRPO style updates.
 
 ## Code Example
 ```python
 class RolloutBuffer:
-    def __init__(self, num_steps, num_envs, obs_dim, action_dim, device):
-        # Initialize tensors
+    def __init__(self, num_envs, rollout_len, obs_dim, action_dim, device):
+        # Preallocate rollout TensorDict
         ...
-    def add(self, obs, action, reward, done, value, logprob):
-        # Add data
+    def start_rollout(self):
+        # Return shared rollout storage
         ...
-    def set_extras(self, extras):
-        # Attach algorithm-related tensors
+    def add(self, rollout):
+        # Mark rollout as full
         ...
-    def iterate_minibatches(self, batch_size):
-        # Random minibatch sampling
+    def get(self, flatten=True):
+        # Consume rollout
         ...
 ```
 
 ## Practical Tips
-- It is recommended to call set_extras after each rollout to ensure advantage/return tensors align with main data.
-- When using iterate_minibatches, set batch_size appropriately for training stability.
-- Extend the extras field as needed for custom sampling and statistics.
+- The rollout buffer stores flattened RL observations; structured observations should be flattened or encoded before entering this buffer.
+- `next.value` is kept for bootstrap convenience, while `next.obs` is intentionally not stored to reduce duplicated memory.
+- Use `buffer/utils.py` for shared minibatch iteration instead of duplicating batching logic in each algorithm.
 
 ---
diff --git a/docs/source/overview/rl/models.md b/docs/source/overview/rl/models.md
@@ -7,9 +7,10 @@ This module contains RL policy networks and related model implementations, suppo
 ### Policy
 - Abstract base class for RL policies; all policies must inherit from it.
 - Unified interface:
-    - `get_action(obs, deterministic=False)`: Sample or output actions.
-    - `get_value(obs)`: Estimate state value.
-    - `evaluate_actions(obs, actions)`: Evaluate action probabilities, entropy, and value.
+    - `get_action(tensordict, deterministic=False)`: Sample actions into a `TensorDict` without gradients.
+    - `forward(tensordict, deterministic=False)`: Low-level action/value write path used by policy implementations.
+    - `get_value(tensordict)`: Estimate state value into a `TensorDict`.
+    - `evaluate_actions(tensordict)`: Return optimization-time policy outputs from a `TensorDict`.
 - Supports GPU deployment and distributed training.
 
 ### ActorCritic
@@ -19,8 +20,8 @@ This module contains RL policy networks and related model implementations, suppo
 - Actor-only policy without Critic. Used with GRPO (Group Relative Policy Optimization), which estimates advantages via group-level return comparison instead of a value function.
 - Supports Gaussian action distributions, learnable log_std, suitable for continuous action spaces.
 - Key methods:
-    - `get_action`: Actor network outputs mean, samples action, returns log_prob and critic value.
-    - `evaluate_actions`: Used for loss calculation in PPO/SAC algorithms.
+    - `forward`: Actor network outputs mean, samples action, and writes policy outputs into a `TensorDict`.
+    - `evaluate_actions`: Used for loss calculation in PPO/GRPO algorithms.
 - Custom actor/critic network architectures supported (e.g., MLP/CNN/Transformer).
 
 ### MLP
@@ -36,8 +37,19 @@ This module contains RL policy networks and related model implementations, suppo
 ```python
 actor = build_mlp_from_cfg(actor_cfg, obs_dim, action_dim)
 critic = build_mlp_from_cfg(critic_cfg, obs_dim, 1)
-policy = build_policy(policy_block, obs_space, action_space, device, actor=actor, critic=critic)
-action, log_prob, value = policy.get_action(obs)
+policy = build_policy(
+    policy_block,
+    env.flattened_observation_space,
+    env.action_space,
+    device,
+    actor=actor,
+    critic=critic,
+)
+step_td = TensorDict({"obs": obs}, batch_size=[obs.shape[0]], device=obs.device)
+step_td = policy.get_action(step_td)
+action = step_td["action"]
+log_prob = step_td["sample_log_prob"]
+value = step_td["value"]
 ```
 
 ## Extension and Customization

diff --git a/docs/source/overview/rl/trainer.md b/docs/source/overview/rl/trainer.md
@@ -20,7 +20,7 @@ This module implements the main RL training loop, logging management, and event-
 
 ## Main Methods
 - `train(total_timesteps)`: Main training loop, automatically collects data, updates policy, and logs.
-- `_collect_rollout()`: Collect one rollout, supports custom callback statistics.
+- `_collect_rollout()`: Collect one rollout through `SyncCollector`, supports custom callback statistics.
 - `_log_train(losses)`: Log training loss, reward, sampling speed, etc.
 - `_eval_once()`: Periodic evaluation, records evaluation metrics.
 - `save_checkpoint()`: Save model parameters and training state.
@@ -35,7 +35,7 @@ This module implements the main RL training loop, logging management, and event-
 
 ## Usage Example
 ```python
-trainer = Trainer(policy, env, algorithm, num_steps, batch_size, writer, ...)
+trainer = Trainer(policy, env, algorithm, buffer_size, batch_size, writer, ...)
 trainer.train(total_steps)
 trainer.save_checkpoint()
 ```
@@ -44,6 +44,7 @@ trainer.save_checkpoint()
 - Custom event modules can be implemented for environment reset, data collection, evaluation, etc.
 - Supports multi-environment parallelism and distributed training.
 - Training process can be flexibly adjusted via config files.
+- The current trainer uses a shared rollout `TensorDict`: collector writes policy-side fields and `EmbodiedEnv` writes environment-side `next.*` fields through `set_rollout_buffer()`.
 
 ## Practical Tips
 - It is recommended to perform periodic evaluation and model saving to prevent loss of progress during training.