Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion configs/agents/rl/basic/cart_pole/train_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"gpu_id": 0,
"num_envs": 64,
"iterations": 1000,
"rollout_steps": 1024,
"buffer_size": 1024,
"eval_freq": 200,
"save_freq": 200,
"use_wandb": false,
Expand Down
2 changes: 1 addition & 1 deletion configs/agents/rl/basic/cart_pole/train_config_grpo.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"gpu_id": 0,
"num_envs": 64,
"iterations": 1000,
"rollout_steps": 1024,
"buffer_size": 1024,
"eval_freq": 200,
"save_freq": 200,
"use_wandb": true,
Expand Down
2 changes: 1 addition & 1 deletion configs/agents/rl/push_cube/train_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"gpu_id": 0,
"num_envs": 64,
"iterations": 1000,
"rollout_steps": 1024,
"buffer_size": 1024,
"enable_eval": true,
"num_eval_envs": 16,
"num_eval_episodes": 3,
Expand Down
37 changes: 12 additions & 25 deletions docs/source/overview/rl/algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,17 @@ This module contains the core implementations of reinforcement learning algorith
## Main Classes and Functions

### BaseAlgorithm
- Abstract base class for RL algorithms, defining common interfaces such as buffer initialization, data collection, and update.
- Abstract base class for RL algorithms, defining a single update interface over a collected rollout.
- Key methods:
- `initialize_buffer(num_steps, num_envs, obs_dim, action_dim)`: Initialize the trajectory buffer.
- `collect_rollout(env, policy, obs, num_steps, on_step_callback)`: Collect interaction data.
- `update()`: Update the policy based on collected data.
- Designed to be algorithm-agnostic; Trainer only depends on this interface to support various RL algorithms.
- Supports multi-environment parallel collection, compatible with Gymnasium/IsaacGym environments.
- `update(rollout)`: Update the policy based on a shared rollout `TensorDict`.
- Designed to be algorithm-agnostic; `Trainer` handles collection while algorithms focus on loss computation and optimization.
- Supports multi-environment parallel collection through a shared `[N, T]` rollout `TensorDict`.

### PPO
- Mainstream on-policy algorithm, supports Generalized Advantage Estimation (GAE), policy update, and hyperparameter configuration.
- Key methods:
- `_compute_gae(rewards, values, dones)`: Generalized Advantage Estimation.
- `collect_rollout`: Collect trajectories and compute advantages/returns.
- `update`: Multi-epoch minibatch optimization, including entropy, value, and policy loss, with gradient clipping.
- `compute_gae(rollout, gamma, gae_lambda)`: Generalized Advantage Estimation over a shared rollout `TensorDict`.
- `update(rollout)`: Multi-epoch minibatch optimization, including entropy, value, and policy loss, with gradient clipping.
- Supports custom callbacks, detailed logging, and GPU acceleration.
- Typical training flow: collect rollout → compute advantage/return → multi-epoch minibatch optimization.
- Supports advantage normalization, entropy regularization, value loss weighting, etc.
Expand All @@ -31,8 +28,7 @@ This module contains the core implementations of reinforcement learning algorith
- Key methods:
- `_compute_step_returns_and_mask(rewards, dones)`: Step-wise discounted returns and valid-step mask.
- `_compute_step_group_advantages(step_returns, seq_mask)`: Per-step group normalization with masked mean/std.
- `collect_rollout`: Collect trajectories and compute step-wise advantages.
- `update`: Multi-epoch minibatch optimization with optional KL penalty.
- `update(rollout)`: Multi-epoch minibatch optimization with optional KL penalty.
- Supports both **Embodied AI** (dense reward, from-scratch training) and **VLA** (sparse reward, fine-tuning) modes via `kl_coef` configuration.

### Config Classes
Expand All @@ -43,19 +39,11 @@ This module contains the core implementations of reinforcement learning algorith
## Code Example
```python
class BaseAlgorithm:
def initialize_buffer(self, num_steps, num_envs, obs_dim, action_dim):
...
def collect_rollout(self, env, policy, obs, num_steps, on_step_callback=None):
...
def update(self):
def update(self, rollout):
...

class PPO(BaseAlgorithm):
def _compute_gae(self, rewards, values, dones):
...
def collect_rollout(self, ...):
...
def update(self):
def update(self, rollout):
...
```

Expand All @@ -71,10 +59,9 @@ class PPO(BaseAlgorithm):
- Typical usage:
```python
algo = PPO(cfg, policy)
buffer = algo.initialize_buffer(...)
for _ in range(num_iterations):
algo.collect_rollout(...)
algo.update()
rollout = collector.collect(buffer_size, rollout=buffer.start_rollout())
buffer.add(rollout)
algo.update(buffer.get(flatten=False))
```

---
73 changes: 38 additions & 35 deletions docs/source/overview/rl/buffer.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,61 +5,64 @@ This module implements the data buffer for RL training, responsible for storing
## Main Classes and Structure

### RolloutBuffer
- Used for on-policy algorithms (such as PPO, GRPO), efficiently stores observations, actions, rewards, dones, values, and logprobs for each step.
- Supports multi-environment parallelism (shape: [T, N, ...]), all data allocated on GPU.
- Used for on-policy algorithms (such as PPO, GRPO), storing a shared rollout `TensorDict` for collector and algorithm stages.
- Supports multi-environment parallelism with rollout batch shape `[N, T]`, all data allocated on GPU.
- Structure fields:
- `obs`: Observation tensor, float32, shape [T, N, obs_dim]
- `actions`: Action tensor, float32, shape [T, N, action_dim]
- `rewards`: Reward tensor, float32, shape [T, N]
- `dones`: Done flags, bool, shape [T, N]
- `values`: Value estimates, float32, shape [T, N]
- `logprobs`: Action log probabilities, float32, shape [T, N]
- `_extras`: Algorithm-specific fields (e.g., advantages, returns), dict[str, Tensor]
- `obs`: Flattened observation tensor, float32, shape `[N, T, obs_dim]`
- `action`: Action tensor, float32, shape `[N, T, action_dim]`
- `sample_log_prob`: Action log probabilities, float32, shape `[N, T]`
- `value`: Value estimates, float32, shape `[N, T]`
- `next.reward`: Reward tensor, float32, shape `[N, T]`
- `next.done`: Done flags, bool, shape `[N, T]`
- `next.terminated`: Termination flags, bool, shape `[N, T]`
- `next.truncated`: Truncation flags, bool, shape `[N, T]`
- `next.value`: Bootstrap next-state values, float32, shape `[N, T]`
- Algorithm-added fields such as `advantage`, `return`, `seq_mask`, and `seq_return`

## Main Methods
- `add(obs, action, reward, done, value, logprob)`: Add one step of data.
- `set_extras(extras)`: Attach algorithm-related tensors (e.g., advantages, returns).
- `iterate_minibatches(batch_size)`: Randomly sample minibatches, returns dict (including all fields and extras).
- Supports efficient GPU shuffle and indexing for large-scale training.
- `start_rollout()`: Returns the shared preallocated rollout `TensorDict` for collector write-in.
- `add(rollout)`: Marks the shared rollout as ready for consumption.
- `get(flatten=True)`: Returns the stored rollout, optionally flattened over `[N, T]`.
- `iterate_minibatches(rollout, batch_size, device)`: Shared batching utility in `buffer/utils.py`.

## Usage Example
```python
buffer = RolloutBuffer(num_steps, num_envs, obs_dim, action_dim, device)
for t in range(num_steps):
buffer.add(obs, action, reward, done, value, logprob)
buffer.set_extras({"advantages": adv, "returns": ret})
for batch in buffer.iterate_minibatches(batch_size):
# batch["obs"], batch["actions"], batch["advantages"] ...
buffer = RolloutBuffer(num_envs, rollout_len, obs_dim, action_dim, device)
rollout = collector.collect(num_steps=rollout_len, rollout=buffer.start_rollout())
buffer.add(rollout)

rollout = buffer.get(flatten=False)
for batch in iterate_minibatches(rollout.reshape(-1), batch_size, device):
# batch["obs"], batch["action"], batch["advantage"] ...
pass
```

## Design and Extension
- Supports multi-environment parallel collection, compatible with Gymnasium/IsaacGym environments.
- All data is allocated on GPU to avoid frequent CPU-GPU copying.
- The extras field can be flexibly extended to meet different algorithm needs (e.g., GAE, TD-lambda, distributional advantages).
- The iterator automatically shuffles to improve training stability.
- Compatible with various RL algorithms (PPO, GRPO, A2C, SAC, etc.), custom fields and sampling logic supported.
- Supports multi-environment parallel collection, compatible with Gymnasium-style vectorized environments.
- All tensors are preallocated on device to avoid frequent CPU-GPU copying.
- Algorithm-specific fields are attached directly onto the shared rollout `TensorDict` during optimization.
- The shared minibatch iterator automatically shuffles flattened rollout entries for PPO/GRPO style updates.

## Code Example
```python
class RolloutBuffer:
def __init__(self, num_steps, num_envs, obs_dim, action_dim, device):
# Initialize tensors
def __init__(self, num_envs, rollout_len, obs_dim, action_dim, device):
# Preallocate rollout TensorDict
...
def add(self, obs, action, reward, done, value, logprob):
# Add data
def start_rollout(self):
# Return shared rollout storage
...
def set_extras(self, extras):
# Attach algorithm-related tensors
def add(self, rollout):
# Mark rollout as full
...
def iterate_minibatches(self, batch_size):
# Random minibatch sampling
def get(self, flatten=True):
# Consume rollout
...
```

## Practical Tips
- It is recommended to call set_extras after each rollout to ensure advantage/return tensors align with main data.
- When using iterate_minibatches, set batch_size appropriately for training stability.
- Extend the extras field as needed for custom sampling and statistics.
- The rollout buffer stores flattened RL observations; structured observations should be flattened or encoded before entering this buffer.
- `next.value` is kept for bootstrap convenience, while `next.obs` is intentionally not stored to reduce duplicated memory.
- Use `buffer/utils.py` for shared minibatch iteration instead of duplicating batching logic in each algorithm.

---
26 changes: 19 additions & 7 deletions docs/source/overview/rl/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@ This module contains RL policy networks and related model implementations, suppo
### Policy
- Abstract base class for RL policies; all policies must inherit from it.
- Unified interface:
- `get_action(obs, deterministic=False)`: Sample or output actions.
- `get_value(obs)`: Estimate state value.
- `evaluate_actions(obs, actions)`: Evaluate action probabilities, entropy, and value.
- `get_action(tensordict, deterministic=False)`: Sample actions into a `TensorDict` without gradients.
- `forward(tensordict, deterministic=False)`: Low-level action/value write path used by policy implementations.
- `get_value(tensordict)`: Estimate state value into a `TensorDict`.
- `evaluate_actions(tensordict)`: Return optimization-time policy outputs from a `TensorDict`.
- Supports GPU deployment and distributed training.

### ActorCritic
Expand All @@ -19,8 +20,8 @@ This module contains RL policy networks and related model implementations, suppo
- Actor-only policy without Critic. Used with GRPO (Group Relative Policy Optimization), which estimates advantages via group-level return comparison instead of a value function.
- Supports Gaussian action distributions, learnable log_std, suitable for continuous action spaces.
- Key methods:
- `get_action`: Actor network outputs mean, samples action, returns log_prob and critic value.
- `evaluate_actions`: Used for loss calculation in PPO/SAC algorithms.
- `forward`: Actor network outputs mean, samples action, and writes policy outputs into a `TensorDict`.
- `evaluate_actions`: Used for loss calculation in PPO/GRPO algorithms.
- Custom actor/critic network architectures supported (e.g., MLP/CNN/Transformer).

### MLP
Expand All @@ -36,8 +37,19 @@ This module contains RL policy networks and related model implementations, suppo
```python
actor = build_mlp_from_cfg(actor_cfg, obs_dim, action_dim)
critic = build_mlp_from_cfg(critic_cfg, obs_dim, 1)
policy = build_policy(policy_block, obs_space, action_space, device, actor=actor, critic=critic)
action, log_prob, value = policy.get_action(obs)
policy = build_policy(
policy_block,
env.flattened_observation_space,
env.action_space,
device,
actor=actor,
critic=critic,
)
step_td = TensorDict({"obs": obs}, batch_size=[obs.shape[0]], device=obs.device)
step_td = policy.get_action(step_td)
action = step_td["action"]
log_prob = step_td["sample_log_prob"]
value = step_td["value"]
```

## Extension and Customization
Expand Down
5 changes: 3 additions & 2 deletions docs/source/overview/rl/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This module implements the main RL training loop, logging management, and event-

## Main Methods
- `train(total_timesteps)`: Main training loop, automatically collects data, updates policy, and logs.
- `_collect_rollout()`: Collect one rollout, supports custom callback statistics.
- `_collect_rollout()`: Collect one rollout through `SyncCollector`, supports custom callback statistics.
- `_log_train(losses)`: Log training loss, reward, sampling speed, etc.
- `_eval_once()`: Periodic evaluation, records evaluation metrics.
- `save_checkpoint()`: Save model parameters and training state.
Expand All @@ -35,7 +35,7 @@ This module implements the main RL training loop, logging management, and event-

## Usage Example
```python
trainer = Trainer(policy, env, algorithm, num_steps, batch_size, writer, ...)
trainer = Trainer(policy, env, algorithm, buffer_size, batch_size, writer, ...)
trainer.train(total_steps)
trainer.save_checkpoint()
```
Expand All @@ -44,6 +44,7 @@ trainer.save_checkpoint()
- Custom event modules can be implemented for environment reset, data collection, evaluation, etc.
- Supports multi-environment parallelism and distributed training.
- Training process can be flexibly adjusted via config files.
- The current trainer uses a shared rollout `TensorDict`: collector writes policy-side fields and `EmbodiedEnv` writes environment-side `next.*` fields through `set_rollout_buffer()`.

## Practical Tips
- It is recommended to perform periodic evaluation and model saving to prevent loss of progress during training.
Expand Down
Loading
Loading