From 7199459d129e92d9545e0f3b3a06ab3c8d0259c6 Mon Sep 17 00:00:00 2001 From: gokulp01 Date: Wed, 15 Oct 2025 07:28:02 -0500 Subject: [PATCH 1/5] fix small typos Signed-off-by: gokulp01 --- sections/01_introduction.tex | 2 +- sections/02_classic_robotics.tex | 6 +++--- sections/04_imitation_learning.tex | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/sections/01_introduction.tex b/sections/01_introduction.tex index f440de3..cc59038 100644 --- a/sections/01_introduction.tex +++ b/sections/01_introduction.tex @@ -83,7 +83,7 @@ \subsection{Code Example: Batching a (Streaming) Dataset} In practice, most reinforcement learning (RL) and behavioral cloning (BC) algorithms tend to operate on stack of observation and actions. For the sake of brevity, we will refer to joint spaces, and camera frames with the single term of \emph{frame}. For instance, RL algorithms may use a history of previous frames \(o_{t-H_o:t} \) to mitigate partial observability, and BC algorithms are in practice trained to regress chunks of multiple actions (\(a_{t+t+H_a} \)) rather than single controls. -To accommodate for these specifics of robot learning training, \lerobotdataset~provides a native windowing operation, whereby users can define the \emph{seconds} of a given window (before and after) around any given frame, by using the \texttt{delta\_timestemps} functionality. +To accommodate for these specifics of robot learning training, \lerobotdataset~provides a native windowing operation, whereby users can define the \emph{seconds} of a given window (before and after) around any given frame, by using the \texttt{delta\_timestamps} functionality. Unavailable frames are opportunely padded, and a padding mask is also returned to filter out the padded frames. Notably, this all happens within the \lerobotdataset, and is entirely transparent to higher level wrappers commonly used in training ML models such as \texttt{torch.utils.data.DataLoader}. diff --git a/sections/02_classic_robotics.tex b/sections/02_classic_robotics.tex index ea21f44..0fac978 100644 --- a/sections/02_classic_robotics.tex +++ b/sections/02_classic_robotics.tex @@ -12,15 +12,15 @@ \subsection{Explicit and Implicit Models} \begin{figure} \centering \includegraphics[width=0.5\linewidth]{figures/ch2/ch2-approaches.pdf} - \caption{Overview of methods to generate motion (clearly non-exhausitve, see~\citet{bekrisStateRobotMotion2024}). The different methods can be grouped based on whether they explicitly (\emph{dynamics-based}) or implicitly (\emph{learning-based}) model robot-environment interactions.} + \caption{Overview of methods to generate motion (clearly non-exhaustive, see~\citet{bekrisStateRobotMotion2024}). The different methods can be grouped based on whether they explicitly (\emph{dynamics-based}) or implicitly (\emph{learning-based}) model robot-environment interactions.} \label{fig:generating-motion-atlas} \end{figure} Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion. -Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. +Thus, robotics is an inherently multidisciplinary domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s. More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines~\citep{connellRobotLearning1993}. -As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of \highlight{producing artificial motion in the physical world}. +As a direct consequence of its multidisciplinary nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of \highlight{producing artificial motion in the physical world}. Methods to produce robotics motion range from traditional \emph{explicit} models---\highlight{dynamics-based}\footnote{In here, we refer to both \emph{kinematics} and \emph{dynamics}-based control.} methods, leveraging precise descriptions of the mechanics of robots' rigid bodies and their interactions with eventual obstacles in the environment---to \emph{implicit} models---\highlight{learning-based} methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings~\citep{agrawalComputationalSensorimotorLearning,bekrisStateRobotMotion2024}. A variety of methods have been developed between these two extrema. diff --git a/sections/04_imitation_learning.tex b/sections/04_imitation_learning.tex index 86c6b55..1aa33cf 100644 --- a/sections/04_imitation_learning.tex +++ b/sections/04_imitation_learning.tex @@ -42,7 +42,7 @@ \section{Robot (Imitation) Learning} \label{fig:ch4-observation-action-mapping} \end{figure} -Behavioral Cloning (BC)~\citep{pomerleauALVINNAutonomousLand1988} aims at producing synthetic behaviors by learning the mapping from observations to actions, and in its most natural formulation can be effectively tackled as a \emph{supevised} learning problem, consisting of learning the (deterministic) mapping \(f: \obsspace \mapsto \actionspace, \ a_t = f(o_t) \) by solving +Behavioral Cloning (BC)~\citep{pomerleauALVINNAutonomousLand1988} aims at producing synthetic behaviors by learning the mapping from observations to actions, and in its most natural formulation can be effectively tackled as a \emph{supervised} learning problem, consisting of learning the (deterministic) mapping \(f: \obsspace \mapsto \actionspace, \ a_t = f(o_t) \) by solving \begin{equation}\label{eq:loss-minimization-SL} \min_{f} \mathbb{E}_{(o_t, a_t) \sim p(\bullet)} \mathcal L(a_t, f(o_t)), \end{equation} From 90f9c48ba77c1b25ddda60d24680b022d6be867e Mon Sep 17 00:00:00 2001 From: Gokul <43350089+gokulp01@users.noreply.github.com> Date: Wed, 15 Oct 2025 07:41:43 -0500 Subject: [PATCH 2/5] fixed (few more) small typos --- sections/03_reinforcement_learning.tex | 2 +- sections/04_imitation_learning.tex | 6 +++--- sections/05_foundation_models.tex | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/sections/03_reinforcement_learning.tex b/sections/03_reinforcement_learning.tex index 92791fe..b78aff8 100644 --- a/sections/03_reinforcement_learning.tex +++ b/sections/03_reinforcement_learning.tex @@ -151,7 +151,7 @@ \subsection{Real-world RL for Robotics} First, especially early in training, \highlight{actions are typically explorative, and thus may be erractic}. On physical systems, untrained policies may command high velocities, self-collisiding configurations, or torques exceeding joint limits, leading to wear and potential hardware damage. -Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incuring in a high degree of human supervision. +Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incurring in a high degree of human supervision. Further, in the typical episodic setting considered in most robotics problems, experimentation is substantially slowed down by the need to manually reset the environment over the course of training, a time-consuming and error-prone process. Second, learning efficiently remains problematic in RL, \highlight{limiting the applicability of RL in real-world robotics due to consequently prohibitive timescales of training}. Even strong algorithms such as SAC~\citep{haarnojaSoftActorCriticOffPolicy2018} typically require a large numbers of transitions \( \{ \sars \}_{t=1}^N \). diff --git a/sections/04_imitation_learning.tex b/sections/04_imitation_learning.tex index 1aa33cf..dae6c6a 100644 --- a/sections/04_imitation_learning.tex +++ b/sections/04_imitation_learning.tex @@ -67,13 +67,13 @@ \section{Robot (Imitation) Learning} \begin{figure} \centering \includegraphics[width=0.8\textwidth]{figures/ch4/ch4-issues-with-bc.pdf} - \caption{Point-wise policies suffer from limitations due to (A) covariate shifts and (B) poor approximation of multimodal demonstrations. (A) Small errors may drive the policy out of distribution, incuring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in the scene---either left or right-first---are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations.} + \caption{Point-wise policies suffer from limitations due to (A) covariate shifts and (B) poor approximation of multimodal demonstrations. (A) Small errors may drive the policy out of distribution, incurring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in the scene---either left or right-first---are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations.} \label{fig:ch4-issues-with-bc} \end{figure} While conceptually elegant, \emph{point-estimate policies} \( f : \obsspace \mapsto \actionspace \) learned by solving eq.~\ref{eq:loss-minimization-SL} have been observed to suffer from (1) compounding errors~\citep{rossReductionImitationLearning2011} and (2) poor fit to multimodal distributions~\citep{florenceImplicitBehavioralCloning2022, keGraspingChopsticksCombating2020}. Figure~\ref{fig:ch4-issues-with-bc} illustrates these two key issues related to learning \emph{explicit policies}~\citep{florenceImplicitBehavioralCloning2022}. -Besides sequentiality in \( \mathcal D \), compounding errors due to \emph{covariate shift} may also prove catastrophic, as even small \( \epsilon \)-prediction errors \( 0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon \) can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus compounding errors (Figure~\ref{fig:ch4-issues-with-bc}, left). +Besides sequentiality in \( \mathcal D \), compounding errors due to \emph{covariate shift} may also prove catastrophic, as even small \( \epsilon \)-prediction errors \( 0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon \) can quickly drive the policy into out-of-distribution states, incurring in less confident generations and thus compounding errors (Figure~\ref{fig:ch4-issues-with-bc}, left). Moreover, point-estimate policies typically fail to learn \emph{multimodal} targets, which are very common in human demonstrations solving real-world robotics problems, as multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure~\ref{fig:ch4-issues-with-bc}, right). In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands~\citep{florenceImplicitBehavioralCloning2022}. To address poor multimodal fitting,~\citet{florenceImplicitBehavioralCloning2022} propose learning the \emph{generative model} \( p(o, a) \) underlying the samples in \( \mathcal D \), rather than explicitly learning a prediction function \( f: a = f(o) \). @@ -198,7 +198,7 @@ \subsubsection{Diffusion Models} DMs are a particular instantiation of HMLV models for which the posterior is fixed to \( q( z_t \vert z_{t-1}) = \mathcal N(z_t \sqrt{1-\beta_t}, \beta_t \mathbf{I}) \), for a given \( \beta_t \in \mathbb R^+ \). In practice, \( \beta_t \) is used to iteratively reduce the signal-to-noise ratio along the latents' hierarchy, similarily to how a diffusion process influences the information of a physical system. -Just like VAEs, DMs attemp to learn to reproduce an underlying data distribution \( p (o,a) \) given a collection of i.i.d. samples approximating the model posited to have generated the data in the first place (eq.~\ref{eq:BC-multi-latent-model-1}). +Just like VAEs, DMs attempt to learn to reproduce an underlying data distribution \( p (o,a) \) given a collection of i.i.d. samples approximating the model posited to have generated the data in the first place (eq.~\ref{eq:BC-multi-latent-model-1}). Similarily to VAEs, DMs approximate the process of sampling from the unknown \( p(o,a) \) by (1) sampling from an easy-to-sample distribution (e.g., Gaussian) and (2) learning to reconstruct high-likelihood samples under the unknown distribution. However, in stark contrast with VAEs, the easy-to-sample distribution contains \emph{no mutual information} regarding the data distribution \( p(o,a) \). Crucially, as no information from the sample \( (o,a) \) (denoted as \( z_0 \equiv (o,a) \) for simplicity of notation) is assumed to be propagated throughout the chain of latents, the posterior \( q(z_t \vert z_{t-1})\) assumes a relatively amicable structure in DMs, reducing complexity. diff --git a/sections/05_foundation_models.tex b/sections/05_foundation_models.tex index 884d4ac..6572843 100644 --- a/sections/05_foundation_models.tex +++ b/sections/05_foundation_models.tex @@ -137,7 +137,7 @@ \subsection{\( \pi_0 \)} \end{equation*} Note how \emph{intra}-block directional attention allows tokens to communicate freely, while \emph{inter}-block communication is mediated by the attention mask \(\mathbf{A} \). \emph{Blockwise causal masking} effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data. -Crucially, because communication is obstructed between image-language tokens, proprioperceptive tokens and action tokens, one can cache keys and values across denoising steps at runtime time, incuring in a reduced computational footprint and faster inference. +Crucially, because communication is obstructed between image-language tokens, proprioperceptive tokens and action tokens, one can cache keys and values across denoising steps at runtime time, incurring in a reduced computational footprint and faster inference. In \pizero, both the VLM backbone and action expert are update using a \emph{flow matching} loss, and in particular are updated minimizing: \begin{align} From 4d59b3ce523bee0a0aeb1fe14fbbe6ab790bf366 Mon Sep 17 00:00:00 2001 From: Loy van Beek Date: Wed, 15 Oct 2025 13:56:58 +0200 Subject: [PATCH 3/5] Fixed some small typos I spotted during a quick browse --- sections/03_reinforcement_learning.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sections/03_reinforcement_learning.tex b/sections/03_reinforcement_learning.tex index b78aff8..1f4c99d 100644 --- a/sections/03_reinforcement_learning.tex +++ b/sections/03_reinforcement_learning.tex @@ -17,7 +17,7 @@ \section{Robot (Reinforcement) Learning} \end{figure} Learning-based techniques for robotics naturally address the limitations presented in Section~\ref{sec:classical} (Figure~\ref{fig:robot-learning-upsides}). -In particular, learning-based techniques typically rely on monolithich prediction-to-action pipelines (\emph{visuomotor policies}) which do directly map sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. +In particular, learning-based techniques typically rely on monolithic prediction-to-action pipelines (\emph{visuomotor policies}) which do directly map sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. Mapping sensory inputs to actions also makes it possible to incorporate diverse input modalities, leveraging the automatic feature extraction capabilities of modern learning systems. Moreover, learning-based approaches can, in principle, bypass explicit modeling altogether and instead rely solely on interaction data---an advantage that proves transformative when dynamics are difficult to model or entirely unknown. Lastly, learning for robotics (\emph{robot learning}) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision and natural language processing did historically benefit from large-scale corpora of data, in great part overlooked by dynamics-based approaches. @@ -25,7 +25,7 @@ \section{Robot (Reinforcement) Learning} Being a field at its relative nascent stages, no prevalent technique(s) proves distinctly better than any other in the domain of robot learning. Still, two major classes of methods gained prominence: \highlight{Reinforcement Learning (RL)} and \highlight{Behavioral Cloning (BC)} (Figure~\ref{fig:robot-learning-atlas}). In this section, we provide a conceptual overview of applications of RL to robotics, as well as introduce practical examples of how to use RL within \lerobot. -We then introduce the major limitations RL suffers from, to introduce BC techniques in Section~\ref{sec:learning-imitation} and Section~{sec:learning-foundation}. +We then introduce the major limitations RL suffers from, to introduce BC techniques in Section~\ref{sec:learning-imitation} and Section~\ref{sec:learning-foundation}. \begin{wrapfigure}[23]{r}{0.3\textwidth} \vspace{-\intextsep} From bef82463e9cc0a664482a5265c001747b51969be Mon Sep 17 00:00:00 2001 From: Damien LaRocque Date: Mon, 10 Nov 2025 22:46:10 +0100 Subject: [PATCH 4/5] Fix some typos in RL --- sections/03_reinforcement_learning.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sections/03_reinforcement_learning.tex b/sections/03_reinforcement_learning.tex index 1f4c99d..63dc50e 100644 --- a/sections/03_reinforcement_learning.tex +++ b/sections/03_reinforcement_learning.tex @@ -73,7 +73,7 @@ \subsection{A (Concise) Introduction to RL} MDPs allowing for an unbounded number of interactions (\( T \to + \infty \)) are termed \emph{infinite-horizon}, and opposed to \emph{finite-horizon} MDPs in which \( T \) is finite. Unless diversely specified, we will only be referring to discrete-time finite-horizon (\emph{episodic}) MDPs. -Formally, a lenght-\(T\) Markov Decision Process (MDP) is a tuple \( \mathcal M = \langle \statespace, \actionspace, \dynamics, r, \gamma, \rho, T \rangle \), where: +Formally, a length-\(T\) Markov Decision Process (MDP) is a tuple \( \mathcal M = \langle \statespace, \actionspace, \dynamics, r, \gamma, \rho, T \rangle \), where: \begin{itemize} \item \(\statespace\) is the \emph{state space}; \(\state \in \statespace\) denotes the (possibly non-directly observable) environment state at time \(t\). In robotics, states often comprise robot configuration and velocities (\(q_t, \dot q_t\)), and can also accomodate sensor readings such as camera or audio streams. % @@ -295,7 +295,7 @@ \subsection{Real-world RL for Robotics} In their technical report,~\citet{luoSERLSoftwareSuite2025} empirically address the needs (1) to define a reward function and (2) to use it starting from unstructured, image observations. In particular,~\citet[SERL]{luoSERLSoftwareSuite2025} introduces a suite of tools streamlining training of \emph{reward classifiers} \( c \), as well as jointly learn forward-backward controllers to speed up real-world RL. -Reward classifiers are particularly useful in treating complex, dynamic tasks---e.g., folding a t-shirt---for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success (\(e^+\)) or failure (\(e^-\)) states, rather than from a precise formulation of \( r_t \), with a natural target for the reward classifier being \( r(s) = \log c(e^+ \ vert s ) \). +Reward classifiers are particularly useful in treating complex, dynamic tasks---e.g., folding a t-shirt---for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success (\(e^+\)) or failure (\(e^-\)) states, rather than from a precise formulation of \( r_t \), with a natural target for the reward classifier being \( r(s) = \log c(e^+ \vert s ) \). Furthermore,~\citet{luoSERLSoftwareSuite2025} demonstrate the benefits of learning separate (1) \emph{forward} and (2) \emph{backward} controllers---parametrized by separate policies---where (1) the former learns to execute a task to completion and (2) the latter learns to reset the environment to its initial state from terminal states, thereby aiding training in real-world episodic settings. Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability,~\citet{luoSERLSoftwareSuite2025} introduced a modified state and action space, expressing proprioperceptive configurations \( q \) and actions \( \dot q \) in the frame of the end-effector pose at \( t=0 \). From f77d4943d8c56aec1db0e225e9e8dd77d9474407 Mon Sep 17 00:00:00 2001 From: Damien LaRocque Date: Mon, 10 Nov 2025 22:50:15 +0100 Subject: [PATCH 5/5] Correct 'proprioperceptive' as 'proprioceptive' --- sections/03_reinforcement_learning.tex | 70 +++++++++++++------------- sections/04_imitation_learning.tex | 16 +++--- sections/05_foundation_models.tex | 16 +++--- 3 files changed, 51 insertions(+), 51 deletions(-) diff --git a/sections/03_reinforcement_learning.tex b/sections/03_reinforcement_learning.tex index 63dc50e..154893c 100644 --- a/sections/03_reinforcement_learning.tex +++ b/sections/03_reinforcement_learning.tex @@ -4,21 +4,21 @@ \section{Robot (Reinforcement) Learning} \epigraph{\textit{Approximate the solution, not the problem} [...]}{Richard Sutton} \begin{tldr} -The need for expensive, high-fidelity simulators can be obviated learning from real-world data, using sample-efficient algorithms that can safely train directly on hardware. + The need for expensive, high-fidelity simulators can be obviated learning from real-world data, using sample-efficient algorithms that can safely train directly on hardware. \end{tldr} \begin{figure} \centering \includegraphics[width=0.9\linewidth]{figures/ch3/ch3-learning-benefits.pdf} \caption{Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level controller capable to take (2) high-dimensional, unstructured sensorimotor information. Learning (3) does not require a dynamics model and instead focuses on interaction data, and (4) empirically correlates with - the scale of the data used. + the scale of the data used. } \label{fig:robot-learning-upsides} \end{figure} Learning-based techniques for robotics naturally address the limitations presented in Section~\ref{sec:classical} (Figure~\ref{fig:robot-learning-upsides}). In particular, learning-based techniques typically rely on monolithic prediction-to-action pipelines (\emph{visuomotor policies}) which do directly map sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. -Mapping sensory inputs to actions also makes it possible to incorporate diverse input modalities, leveraging the automatic feature extraction capabilities of modern learning systems. +Mapping sensory inputs to actions also makes it possible to incorporate diverse input modalities, leveraging the automatic feature extraction capabilities of modern learning systems. Moreover, learning-based approaches can, in principle, bypass explicit modeling altogether and instead rely solely on interaction data---an advantage that proves transformative when dynamics are difficult to model or entirely unknown. Lastly, learning for robotics (\emph{robot learning}) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision and natural language processing did historically benefit from large-scale corpora of data, in great part overlooked by dynamics-based approaches. @@ -76,11 +76,11 @@ \subsection{A (Concise) Introduction to RL} Formally, a length-\(T\) Markov Decision Process (MDP) is a tuple \( \mathcal M = \langle \statespace, \actionspace, \dynamics, r, \gamma, \rho, T \rangle \), where: \begin{itemize} \item \(\statespace\) is the \emph{state space}; \(\state \in \statespace\) denotes the (possibly non-directly observable) environment state at time \(t\). In robotics, states often comprise robot configuration and velocities (\(q_t, \dot q_t\)), and can also accomodate sensor readings such as camera or audio streams. - % - \item \(\actionspace\) is the \emph{action space}; \(\action \in \actionspace\) may represent joint torques, joint velocities, or even end-effector commands at timestep \( t \). In general, actions correspond to commands intervenings on the configuration of the robot. - % + % + \item \(\actionspace\) is the \emph{action space}; \(\action \in \actionspace\) may represent joint torques, joint velocities, or even end-effector commands at timestep \( t \). In general, actions correspond to commands intervenings on the configuration of the robot. + % \item \(\dynamics\) represents the (possibly non-deterministic) environment dynamics, with \(\dynamics: \statespace \times \actionspace \times \statespace \mapsto [0, 1] \), \( \dynamics \, \transition = \transitionprob \). For instance, for a planar manipulator dynamics could be considered deterministic when the environment is fully described (Figure~\ref{fig:planar-manipulation-simple}), and stochastic when unmodeled disturbances depending on non-observable parameters intervene (Figure~\ref{fig:planar-manipulator-box-velocity}). - % + % \item \(r: \statespace \times \actionspace \times \statespace \to \mathbb R\) is the \emph{reward function}, weighing the transition \( \transition \) in the context of the achievement of an arbitrary goal. For instance, a simple reward function for quickly moving along the \( x \) axis (Figure~\ref{fig:robotics-with-rl-examples}) could be based on the absolute position of the robot along the \( x \) axis~(\(p_{x_t}\)), present negative penalties for falling over (measured from \( p_{z_t} \)) and a introduce bonuses \( \dot p_{x_t} \) for speed, \(r \transition \equiv r(\state) = p_{x_t} \cdot \dot p_{x_t} - \tfrac{1}{p_{z_t}} \). \end{itemize} Lastly, \(\gamma \in [0,1) \) represent the discount factor regulating preference for immediate versus long-term reward (with an effective horizon equal to \( \tfrac{1}{1-\gamma} \)), and \( \rho \) is the distribution over \(\statespace \) for the MDP's \emph{initial}, \( s_0 \sim \rho \). @@ -93,8 +93,8 @@ \subsection{A (Concise) Introduction to RL} Interestingly, assuming both the environment dynamics and conditional distribution over actions given states---i.e., the \emph{policy}---to be \emph{Markovian}: % \begin{align} -\mathbb P(\stateplusone \vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) &= \mathbb P \transitiongiven \label{eq:dynamics_markovian} \\ -\mathbb P(\action \vert \state, a_{t-1}, s_{t-1}, s_0, a_0) &= \mathbb P(\action \vert \state), \label{eq:policy_markovian} + \mathbb P(\stateplusone \vert s_t, a_t, s_{t-1}, a_{t-1}, \dots s_0, a_0 ) & = \mathbb P \transitiongiven \label{eq:dynamics_markovian} \\ + \mathbb P(\action \vert \state, a_{t-1}, s_{t-1}, s_0, a_0) & = \mathbb P(\action \vert \state), \label{eq:policy_markovian} \end{align} % the probability of observing a given trajectory \( \tau \) factorizes into: @@ -102,15 +102,15 @@ \subsection{A (Concise) Introduction to RL} \mathbb P(\tau) = \mathbb P (s_0) \prod_{t=0}^{T-1} \mathbb P \transitiongiven \ \mathbb P(\action \vert \state). \end{equation} -Policies \( \mathbb P(\action \vert \state) \) are typically indicated as \( \pi(\action \vert \state) \), often parametrized via \( \theta \), yielding \( \pi_\theta (\action \vert \state )\), and are traine by optimizing the (discounted) \emph{return} associated to a given \( \tau \), i.e. the (random) sum of measured rewards over an arbitrary trajectory, +Policies \( \mathbb P(\action \vert \state) \) are typically indicated as \( \pi(\action \vert \state) \), often parametrized via \( \theta \), yielding \( \pi_\theta (\action \vert \state )\), and are traine by optimizing the (discounted) \emph{return} associated to a given \( \tau \), i.e. the (random) sum of measured rewards over an arbitrary trajectory, \[ G(\tau) = \sum_{t=0}^{T-1} \gamma^{t} r_t. \] -In that, agents seek to learn control strategies (\emph{policies}, \( \pi_\theta \)) maximizing the expected return \( \mathbb E_{\tau \sim \pi_\theta} G(\tau) \). +In that, agents seek to learn control strategies (\emph{policies}, \( \pi_\theta \)) maximizing the expected return \( \mathbb E_{\tau \sim \pi_\theta} G(\tau) \). For a given dynamics \( \mathcal D \)---i.e., for a given problem---taking the expectation over the (possibly random) trajectories resulting from acting according to a certain policy provides a direct, goal-conditioned ordering in the space of all the possible policies \( \Pi \), yielding the (maximization) target \( J : \Pi \mapsto \mathbb R \) \begin{align} - J(\pi_\theta) &= \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} \left[ G(\tau) \right], \label{eq:RL-j-function} \\ - \mathbb P_{\theta; \mathcal D} (\tau) &= \rho \prod_{t=0}^{T-1} \mathcal D \transition \ \pi_\theta (\action \vert \state).\label{eq:traj-probabilities-for-policies} + J(\pi_\theta) & = \mathbb E_{\tau \sim \mathbb P_{\theta; \mathcal D}} \left[ G(\tau) \right], \label{eq:RL-j-function} \\ + \mathbb P_{\theta; \mathcal D} (\tau) & = \rho \prod_{t=0}^{T-1} \mathcal D \transition \ \pi_\theta (\action \vert \state).\label{eq:traj-probabilities-for-policies} \end{align} Crucially, in the RL framework the agent is assumed to only \emph{observe} the environment dynamics and not to intervene on them, and thus eq.~\ref{eq:RL-j-function} varies exclusively with the policy followed. @@ -127,9 +127,9 @@ \subsection{A (Concise) Introduction to RL} \] Importantly, value functions are interrelated: \begin{align} -Q_\pi(s_t, a_t) &= \mathbb{E}_{\stateplusone \sim \mathbb P(\bullet \vert \state, \action)} \left[ r_t + \gamma V_\pi(\stateplusone) \right] \label{eq:q-as-v} \\ -V_\pi(\state) &= \mathbb E_{\action \sim \pi(\bullet \vert \state)} \left[ Q_\pi (\state, \action) \right], -\label{eq:v-as-q} + Q_\pi(s_t, a_t) & = \mathbb{E}_{\stateplusone \sim \mathbb P(\bullet \vert \state, \action)} \left[ r_t + \gamma V_\pi(\stateplusone) \right] \label{eq:q-as-v} \\ + V_\pi(\state) & = \mathbb E_{\action \sim \pi(\bullet \vert \state)} \left[ Q_\pi (\state, \action) \right], + \label{eq:v-as-q} \end{align} inducing an ordering over states and state-action pairs under \( \pi \), and value functions are thus central to most RL algorithms. A variety of algorithms have been developed in RL attempting to find (approximate) solutions to the problem of maximizing cumulative reward (we report some in Figure~\ref{fig:rl-algos-atlas}). @@ -212,15 +212,15 @@ \subsection{Real-world RL for Robotics} \paragraph{Sample-efficient RL} In an MDP, the optimal policy \( \pi^* \) can be derived from its associated \qfunction, \( Q^* \equiv Q_{\pi^*} \), and in particular the optimal action(s) \(\mu(\state)\) can be selected maximizing the optimal \qfunction \ over the action space, \[ -\mu(\state) = \max_{\action \in \mathcal A} Q^*(\state, \action). + \mu(\state) = \max_{\action \in \mathcal A} Q^*(\state, \action). \] Interestingly, the \qopt-function satisfies a recursive relationship (\emph{Bellman equation}) based on a very natural intuition% \footnote{Quote from~\citet{mnihPlayingAtariDeep2013}. The notation used has slightly been adapted for consistency with the rest of this tutorial.}: \begin{quote} [...] If the optimal value \( Q^*(\stateplusone, a_{t+1}) \) of the [state] \(\stateplusone \) was known for all possible actions \(a_{t+1} \), then the optimal strategy is to select the action \( a_{t+1}\) maximizing the expected value of \( r_t + \gamma Q^*(s_{t+1}, a_{t+1}) \) -\[ -Q^*(s_t, a_t) = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \left[ r_t + \gamma \max_{a_{t+1} \in \mathcal A} Q^*(s_{t+1}, a_{t+1}) \big\vert s_t, a_t \right] -\] + \[ + Q^*(s_t, a_t) = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \left[ r_t + \gamma \max_{a_{t+1} \in \mathcal A} Q^*(s_{t+1}, a_{t+1}) \big\vert s_t, a_t \right] + \] \end{quote} In turn, the optimal \qfunction \ % @@ -229,21 +229,21 @@ \subsection{Real-world RL for Robotics} \[ Q_{i+1}(s_t, a_t) \leftarrow \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \left[ r_t + \gamma \max_{a_{t+1} \in \mathcal A} Q_i (s_{t+1}, a_{t+1}) \big\vert s_t, a_t \right], \quad i=0,1,2,\dots,K \] -Then, one can derive the (ideally, near-optimal) policy by explicitly maximizing over the action space the final (ideally, near-optimal) estimate \( Q_K \approx Q^* \) at each timestep. +Then, one can derive the (ideally, near-optimal) policy by explicitly maximizing over the action space the final (ideally, near-optimal) estimate \( Q_K \approx Q^* \) at each timestep. Indeed, one can show that under certain assumptions on the MDP considered, \( Q_K \to Q^* \, \text{as } K \to \infty \). -Effective in its early applications to small-scale discrete problems, vanilla Q-learning was found complicated to scale to large \( \statespace \times \actionspace \) problems, in which storing \( Q : \statespace \times \actionspace \mapsto \mathbb R \) alone might result prohibitive. +Effective in its early applications to small-scale discrete problems, vanilla Q-learning was found complicated to scale to large \( \statespace \times \actionspace \) problems, in which storing \( Q : \statespace \times \actionspace \mapsto \mathbb R \) alone might result prohibitive. Also, vanilla Q-learning is not directly usable for \emph{continuous}, unstructured state-action space MPDs, such as those considered in robotics. In their seminal work on \emph{Deep Q-Learning} (DQN),~\citet{mnihPlayingAtariDeep2013} propose learning Q-values using deep convolutional neural networks, thereby accomodating for large and even unstructured \emph{state} spaces. DQN parametrizes the Q-function using a neural network with parameters \( \theta \), updating the parameters by sequentially minimizing the expected squared temporal-difference error (TD-error, \( \delta_i \)): \begin{align} -\mathcal L(\theta_i) &= \mathbb E_{(s_t, a_t) \sim \chi(\bullet)} - \big[ - (\underbrace{y_i - Q_{\theta_i}(s_t, a_t)}_{\delta_i})^2 - \big], \label{eq:dqn-loss} \\ - y_i &= \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma \max_{\action \in \mathcal A} Q_{\theta_{i-1}} (\stateplusone, a_{t+1}) \big], \label{eq:TD-target} + \mathcal L(\theta_i) & = \mathbb E_{(s_t, a_t) \sim \chi(\bullet)} + \big[ + (\underbrace{y_i - Q_{\theta_i}(s_t, a_t)}_{\delta_i})^2 + \big], \label{eq:dqn-loss} \\ + y_i & = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma \max_{\action \in \mathcal A} Q_{\theta_{i-1}} (\stateplusone, a_{t+1}) \big], \label{eq:TD-target} \end{align} -where \( \chi \) represents a behavior distribution over state-action pairs. +where \( \chi \) represents a behavior distribution over state-action pairs. Crucially, \( \chi \) can in principle be different from the policy being followed, effectively allowing to reuse prior data stored in a \emph{replay buffer} \( D \) in the form of \( \sars \) transitions, used to form the TD-target \( y_i \), TD-error \( \delta_i \) and loss function eq.~\ref{eq:dqn-loss} via Monte-Carlo (MC) estimates. While effective in handling large, unstructured state spaces for discrete action-space problems, DQN's application to continous control problems proved challenging. @@ -256,7 +256,7 @@ \subsection{Real-world RL for Robotics} ~\citet{lillicrapContinuousControlDeep2019a} extended DPG to the case of (1) high-dimensional unstructured observations and (2) continuous action spaces, introducing Deep Deterministic Policy Gradient (DDPG), an important algorithm in RL and its applications to robotics. DDPG adopts a modified TD-target compared to eq.~\ref{eq:TD-target}, by maintaining a policy network used to select actions, yielding \begin{equation}\label{eq:TD-target-ddpg} -y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma Q_{\theta_{i-1}} (\stateplusone, \mu_\phi(\stateplusone)) \big] . + y_i = \mathbb E_{s_{t+1} \sim \mathbb P(\bullet \vert s_t, a_t)} \big[ r_t + \gamma Q_{\theta_{i-1}} (\stateplusone, \mu_\phi(\stateplusone)) \big] . \end{equation} Similarily to DQN, DDPG also employs the same replay buffer mechanism, reusing past transitions over training for increased sample efficiency and estimate the loss function via MC-estimates. @@ -264,7 +264,7 @@ \subsection{Real-world RL for Robotics} MaxEnt RL~\citep{haarnojaReinforcementLearningDeep2017b} has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective \( J (\pi) \) to specifically account for the policy entropy \( \mathcal H(\pi (\bullet \vert s_t)) \), \begin{align} - J(\pi) &= \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} \left[ r_t + \alpha \mathcal H(\pi (\bullet \vert s_t)) \right]. + J(\pi) & = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \chi} \left[ r_t + \alpha \mathcal H(\pi (\bullet \vert s_t)) \right]. \label{eq:J-soft} \end{align} This modified objective results in the \emph{soft} TD-target: @@ -298,7 +298,7 @@ \subsection{Real-world RL for Robotics} Reward classifiers are particularly useful in treating complex, dynamic tasks---e.g., folding a t-shirt---for which a precise reward formulation is arbitrarily complex to obtain, or that do require significant shaping and are more easily learned directly from demonstrations of success (\(e^+\)) or failure (\(e^-\)) states, rather than from a precise formulation of \( r_t \), with a natural target for the reward classifier being \( r(s) = \log c(e^+ \vert s ) \). Furthermore,~\citet{luoSERLSoftwareSuite2025} demonstrate the benefits of learning separate (1) \emph{forward} and (2) \emph{backward} controllers---parametrized by separate policies---where (1) the former learns to execute a task to completion and (2) the latter learns to reset the environment to its initial state from terminal states, thereby aiding training in real-world episodic settings. -Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability,~\citet{luoSERLSoftwareSuite2025} introduced a modified state and action space, expressing proprioperceptive configurations \( q \) and actions \( \dot q \) in the frame of the end-effector pose at \( t=0 \). +Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability,~\citet{luoSERLSoftwareSuite2025} introduced a modified state and action space, expressing proprioceptive configurations \( q \) and actions \( \dot q \) in the frame of the end-effector pose at \( t=0 \). Randomizing the initial pose of the end-effector (\( s_0 \)),~\citet{luoSERLSoftwareSuite2025} achieved a similar result to that of manually randomizing the environment at every timestep, but with the benefit of maintaining the environment in the same condition across multiple training episodes, achieving higher scalability of their method thanks to the increased practicality of their approach. \begin{figure} @@ -332,7 +332,7 @@ \subsubsection{Code Example: Real-world RL} At a higher level, the HIL-SERL architecture (Figure~\ref{fig:ch3-hil-serl-architecture}) relies on two main components: \begin{itemize} \item An \texttt{Actor}, running a frozen policy network used to interact with the environment and obtain observations. Observations are used to both condition the frozen actor in selecting the action to enact, and to form \( \sars \) transitions that are shared with the \texttt{Learner}. Rewards are inferred using a custom, learned reward classifier trained on a dataset of offline demonstrations. - % + % \item A \texttt{Learner}, used to optimize the policy's parameters \( \theta \) for maximum expected return. The learner samples batches of offline data from online and offline buffers in equal proportion~\citep{ballEfficientOnlineReinforcement2023}, and shares updated parameters with the \texttt{Actor}. \end{itemize} @@ -364,10 +364,10 @@ \subsubsection{Limitations of RL in Real-World Robotics: Simulators and Reward D Despite the advancements in real-world RL training, training RL agents for real-world tasks still suffers from the following limitations: \begin{itemize} -\item In those instances where real-world training experience is prohibitively expensive to gather (e.g., Tokamak control~\citep{degraveMagneticControlTokamak2022}, Autonomous Stratospehere Navigation~\citep{bellemareAutonomousNavigationStratospheric2020})in-simulation training is often the only viable option. -However, high-fidelity simulators for real-world problems can be difficult to build and maintain, especially for contact-rich manipulation and tasks involving deformable or soft materials. + \item In those instances where real-world training experience is prohibitively expensive to gather (e.g., Tokamak control~\citep{degraveMagneticControlTokamak2022}, Autonomous Stratospehere Navigation~\citep{bellemareAutonomousNavigationStratospheric2020})in-simulation training is often the only viable option. + However, high-fidelity simulators for real-world problems can be difficult to build and maintain, especially for contact-rich manipulation and tasks involving deformable or soft materials. -\item Reward design is a fundamental source of brittleness in real-world RL pipelines. While shaping dense rewards is often necessary to guide exploration in long-horizon tasks, the process is error-prone and heavily reliant on human expertise and intuition. Poorly tuned terms can lead to specification gaming or convergence to local optima, making reward shaping a critical challenge for applying RL in practice. Sparse rewards that only signal successful trajectories can avoid these pitfalls but typically result in much slower learning due to reduced supervision. + \item Reward design is a fundamental source of brittleness in real-world RL pipelines. While shaping dense rewards is often necessary to guide exploration in long-horizon tasks, the process is error-prone and heavily reliant on human expertise and intuition. Poorly tuned terms can lead to specification gaming or convergence to local optima, making reward shaping a critical challenge for applying RL in practice. Sparse rewards that only signal successful trajectories can avoid these pitfalls but typically result in much slower learning due to reduced supervision. \end{itemize} Advances in learning to act from potentially large corpora of human demonstrations via Behavioral Cloning (BC) address both of these concerns. diff --git a/sections/04_imitation_learning.tex b/sections/04_imitation_learning.tex index dae6c6a..705b4ed 100644 --- a/sections/04_imitation_learning.tex +++ b/sections/04_imitation_learning.tex @@ -15,7 +15,7 @@ \section{Robot (Imitation) Learning} \begin{figure} \centering \includegraphics[width=0.8\textwidth]{figures/ch4/ch4-bc-trajectories.pdf} - \caption{(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in \url{lerobot/svla_so101_pickplace}. Proprioperceptive states provide invaluable to determine the robot's state during an episode. (B) Camera frames are also recorded alongside measurements on the robot's state, capturing information about the robot's interaction with its environment.} + \caption{(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in \url{lerobot/svla_so101_pickplace}. Proprioceptive states provide invaluable to determine the robot's state during an episode. (B) Camera frames are also recorded alongside measurements on the robot's state, capturing information about the robot's interaction with its environment.} \label{fig:ch4-bc-trajectories} \end{figure} @@ -26,11 +26,11 @@ \section{Robot (Imitation) Learning} Most notably, by \emph{learning-to-imitate}, autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether. Formally, let \( \mathcal D = \{ \tau^{(i)} \}_{i=1}^N \) be a set of expert trajectories, with \( \tau^{(i)} = \{(o_t^{(i)}, a_t^{(i)})\}_{t=0}^{T_i} \) representing the \(i\)-th length-\(T_i\) trajectory in \( \mathcal D \), \(o_t \in \obsspace \) denoting observations (e.g., images and proprioception altogether), and \(a_t \in \actionspace \) the expert actions. -Typically, observations \( o \in \obsspace \) consist of both image and proprioperceptive information, while actions \( a \in \actionspace \) represent control specifications for the robot to execute, e.g. a joint configuration. +Typically, observations \( o \in \obsspace \) consist of both image and proprioceptive information, while actions \( a \in \actionspace \) represent control specifications for the robot to execute, e.g. a joint configuration. Note that differently from Section~\ref{sec:learning-rl}, in the imitation learning context \( \mathcal D \) denotes an offline dataset collecting \( N \) length-\( T_i \) reward-free (expert) human trajectories \( \tau^{(i)} \), and \emph{not} the environment dynamics. Similarily, in this section \( \tau^{(i)} \) represent a length-\(T_i\) trajectory of observation-action pairs, which crucially \emph{omits entirely any reward} information. Figure~\ref{fig:ch4-bc-trajectories} graphically shows trajectories in terms of the average evolution of the actuation on the 6 joints of a teleoperated SO-100 manipulator. -Notice how proprioperceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of both image and joint teleoperation data. +Notice how proprioceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of both image and joint teleoperation data. Figure~\ref{fig:ch4-observation-action-mapping} shows \( (o_t, a_t) \)-pairs for the same dataset, with the actions performed by the human expert illustrated alongside the corresponding observation. In principle, (expert) trajectories \( \tau^{(i)} \) can have different lengths since demonstrations might exhibit multi-modal strategies to attain the same goal, resulting in multiple, different behaviors. @@ -38,7 +38,7 @@ \section{Robot (Imitation) Learning} \begin{figure} \centering \includegraphics[width=0.9\textwidth]{figures/ch4/ch4-observation-action-mapping.pdf} - \caption{Sample observations and action pairs over the course of a given trajectory recorded in \url{lerobot/svla_so101_pickplace}. Observations, comprising of both proprioperceptive and visual information, are recorded alongside the configuration of a second, leader robot controlled by a human expert, providing complete information for regressing actions given observations.} + \caption{Sample observations and action pairs over the course of a given trajectory recorded in \url{lerobot/svla_so101_pickplace}. Observations, comprising of both proprioceptive and visual information, are recorded alongside the configuration of a second, leader robot controlled by a human expert, providing complete information for regressing actions given observations.} \label{fig:ch4-observation-action-mapping} \end{figure} @@ -385,19 +385,19 @@ \subsection{Action Chunking with Transformers} \begin{figure} \centering \includegraphics[width=0.75\textwidth]{figures/ch4/ch4-act-encoder.pdf} - \caption{The CVAE encoder used in ACT. Input action chunks are first embedded and aggregated with positional embeddings, before being processed alongside embedded proprioperceptive information, and a learned \texttt{[CLS]} token used to aggregate input level information, and predict the style variable \( z \). The encoder is exclusively used to \emph{train} the decoder, and it is entirely disregarded at inference time.} + \caption{The CVAE encoder used in ACT. Input action chunks are first embedded and aggregated with positional embeddings, before being processed alongside embedded proprioceptive information, and a learned \texttt{[CLS]} token used to aggregate input level information, and predict the style variable \( z \). The encoder is exclusively used to \emph{train} the decoder, and it is entirely disregarded at inference time.} \label{fig:ch4-act-encoder} \end{figure} However, the authors claim that using a deterministic procedure to sample \( z \) benefits policy evaluation, and thus avoid using the conditional prior at all at inference time, effectively using the CVAE framework exclusively to train a more expressive decoder. At test time,~\citet{zhaoLearningFineGrainedBimanual2023} propose simply using \( z = \mathbf{0} \), as the conditional prior on \( z \) used in training is set to be a standard Gaussian. -Further, conditioning on the observation \( o \) is achieved through explicitly feeding proprioperceptive and visual observations to the decoder, \( p_\theta(a \vert z, o) \) at test time. -If at inference \( z \) is sampled from a standard Gaussian, during training \( z \) is sampled from an approximate posterior distribution \(q_\phi(z \vert o, a)\), which, however, disregards image observations and exclusively uses proprioperceptive states to form \( o \) for efficiency reasons. +Further, conditioning on the observation \( o \) is achieved through explicitly feeding proprioceptive and visual observations to the decoder, \( p_\theta(a \vert z, o) \) at test time. +If at inference \( z \) is sampled from a standard Gaussian, during training \( z \) is sampled from an approximate posterior distribution \(q_\phi(z \vert o, a)\), which, however, disregards image observations and exclusively uses proprioceptive states to form \( o \) for efficiency reasons. \begin{figure} \centering \includegraphics[width=0.75\textwidth]{figures/ch4/ch4-act-decoder.pdf} - \caption{The CVAE decoder used in ACT, comprising of a full encoder-decoder Transformer architecture. Camera observations from all \( n \) camera views are first embedded using pre-trained visual encoders, and then aggregated with the corresponding positional embeddings. Then, the proprioperceptive information and style variable \( z \) retrieved from the CVAE encoder, are fed to the encoder-decoder Transformer for inference. The encoder shares the matrices \( K,V \) with the decoder, and is trained to decode fixed position embeddings into action chunks.} + \caption{The CVAE decoder used in ACT, comprising of a full encoder-decoder Transformer architecture. Camera observations from all \( n \) camera views are first embedded using pre-trained visual encoders, and then aggregated with the corresponding positional embeddings. Then, the proprioceptive information and style variable \( z \) retrieved from the CVAE encoder, are fed to the encoder-decoder Transformer for inference. The encoder shares the matrices \( K,V \) with the decoder, and is trained to decode fixed position embeddings into action chunks.} \label{fig:ch4-act-decoder} \end{figure} diff --git a/sections/05_foundation_models.tex b/sections/05_foundation_models.tex index 6572843..9498779 100644 --- a/sections/05_foundation_models.tex +++ b/sections/05_foundation_models.tex @@ -113,16 +113,16 @@ \subsection{\( \pi_0 \)} \begin{figure} \centering \includegraphics[width=0.9\textwidth]{figures/ch5/ch5-pi0.pdf} - \caption{The \pizero~architecture, as in~\citet{black$p_0$VisionLanguageActionFlow2024}. Vision and language tokens are routed to a VLM backbone which is prevented from attending robot proprioperceptive states and action tokens, which are instead routed to a smaller subset of weights within the architecture referred to as "action expert". The architecture is trained with Flow Matching on 10M+ trajectories from a mixture of closed and openly available datasets.} + \caption{The \pizero~architecture, as in~\citet{black$p_0$VisionLanguageActionFlow2024}. Vision and language tokens are routed to a VLM backbone which is prevented from attending robot proprioceptive states and action tokens, which are instead routed to a smaller subset of weights within the architecture referred to as "action expert". The architecture is trained with Flow Matching on 10M+ trajectories from a mixture of closed and openly available datasets.} \label{fig:ch5-pi0} \end{figure} Concretely, \( \pi_0 \) is a single, unified transformer with two disjoint sets of weights \( \phi, \theta\). A larger VLM backbone \( f_\phi \) initialized from Gemma 2.6B processes multiple image frames obtained from multiple cameras points \( [\{ I_t \}_{t=1}^n] \), as well as a language instruction \([\ell_t]\) used to describe the task considered. -Concurrently, a 300M-parameter \emph{action expert} based on a similar transformer architecture is used to process both the robot proprioperceptive state \(q_t\) and an action chunk \(a_{t:t+H_a}\) (Figure~\ref{fig:ch5-pi0}). +Concurrently, a 300M-parameter \emph{action expert} based on a similar transformer architecture is used to process both the robot proprioceptive state \(q_t\) and an action chunk \(a_{t:t+H_a}\) (Figure~\ref{fig:ch5-pi0}). The different expert networks operate separately in processing the respective inputs and turn them into query, key and value matrices, and only share information between each other via self-attention layers. The outputs from the VLM backbone are disregarded, while the vector field regressed by the action expert is used to iteratively refine the action process. -In particular, \pizero~uses a \emph{blockwise causal attention mask} over tokens belonging to three separate blocks: (1) image and language tokens \(\mathcal T_i \) obtained from \([\{ I_t \}_{t=1}^n, \ell_t]\), (2) proprioperceptive tokens \(\mathcal T_q \) obtained from \(q_t\), and (3) the action tokens \( \mathcal T_a \) for items in the chunk \(a^{\tau}_{t:t+H_a}\) at time \( \tau \) in the flow-matching process. +In particular, \pizero~uses a \emph{blockwise causal attention mask} over tokens belonging to three separate blocks: (1) image and language tokens \(\mathcal T_i \) obtained from \([\{ I_t \}_{t=1}^n, \ell_t]\), (2) proprioceptive tokens \(\mathcal T_q \) obtained from \(q_t\), and (3) the action tokens \( \mathcal T_a \) for items in the chunk \(a^{\tau}_{t:t+H_a}\) at time \( \tau \) in the flow-matching process. Notably, \emph{within} each block the attention operations are bidirectional, while \emph{across} blocks, future blocks are masked out. Formally, this corresponds to using an attention mask like: \begin{equation*} @@ -137,7 +137,7 @@ \subsection{\( \pi_0 \)} \end{equation*} Note how \emph{intra}-block directional attention allows tokens to communicate freely, while \emph{inter}-block communication is mediated by the attention mask \(\mathbf{A} \). \emph{Blockwise causal masking} effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data. -Crucially, because communication is obstructed between image-language tokens, proprioperceptive tokens and action tokens, one can cache keys and values across denoising steps at runtime time, incurring in a reduced computational footprint and faster inference. +Crucially, because communication is obstructed between image-language tokens, proprioceptive tokens and action tokens, one can cache keys and values across denoising steps at runtime time, incurring in a reduced computational footprint and faster inference. In \pizero, both the VLM backbone and action expert are update using a \emph{flow matching} loss, and in particular are updated minimizing: \begin{align} @@ -165,7 +165,7 @@ \subsection{\( \pi_0 \)} In turn, the application of flow matching to large-scale datasets of multiple human behaviors across tasks and embodiments appears rather consequential, particularly considering how it can enable faster inference via a limited number of denoising steps at test time---as few as 10, in \pizero. In particular, the action expert is implemented as a conditional flow matching model. Each action token embeds a noisy action \(a_i^{\tau} \in a^\tau_{t:t+H_a}\), alongside a sinusoidal encoding of the \emph{flow process} timestep \(\tau\). -The action expert then leverages full bidirectional attention across the \(H_a\) action tokens provided, and also attends to previous proprioperceptive and image-language tokens. +The action expert then leverages full bidirectional attention across the \(H_a\) action tokens provided, and also attends to previous proprioceptive and image-language tokens. Interestingly, differently from a standard flow matching pipeline~\citep{lipmanFlowMatchingGenerative2023}, \(\tau\) is \emph{not} sampled from a uniform distribution \(\tau \sim \mathcal U([0,1]) \), but rather obtained from \(\tau \sim \textrm{Beta}(1.5,1) \) defined on the \( [0,s], s<1 \) support (Figure~\ref{fig:ch5-pi0-sampling-timesteps}). \begin{wrapfigure}{r}{0.4\textwidth} @@ -204,7 +204,7 @@ \subsection{SmolVLA} \begin{figure} \centering \includegraphics[width=0.9\textwidth]{figures/ch5/ch5-smolvla.pdf} - \caption{The SmolVLA architecture, as in~\citet{shukorSmolVLAVisionLanguageActionModel2025}. SmolVLA is a compact MoE model trained with flow matching to denoise action chunks. Vision and language tokens are fed to a VLM backbone, and share information with the proprioperceptive and action tokens via the attention mechanism. The attention expert interleaves SA and CA layers for further conditioning on the visual features from the VLM backbone. SmolVLA skips computations and reduces the visual tokens, resulting in 7x less memory usage than \pizero~(450M parameters vs. \pizero's 3.3B).} + \caption{The SmolVLA architecture, as in~\citet{shukorSmolVLAVisionLanguageActionModel2025}. SmolVLA is a compact MoE model trained with flow matching to denoise action chunks. Vision and language tokens are fed to a VLM backbone, and share information with the proprioceptive and action tokens via the attention mechanism. The attention expert interleaves SA and CA layers for further conditioning on the visual features from the VLM backbone. SmolVLA skips computations and reduces the visual tokens, resulting in 7x less memory usage than \pizero~(450M parameters vs. \pizero's 3.3B).} \label{fig:ch5-smolvla} \end{figure} @@ -218,11 +218,11 @@ \subsection{SmolVLA} \citet{shukorSmolVLAVisionLanguageActionModel2025}'s design choices thus result in a much smaller size model compared to \pizero, consisting of ca. 450M parameters versus \pizero's 3.3B parameters. In practice, SmolVLA consumes multi-view RGB images, a natural-language instruction, and projected sensorimotor state token as inputs, together with the noised \emph{action chunk} \( \tilde{a}_{t:t+H_a} \) the action expert \( v_\theta \) is trained to denoise. -The robot proprioperceptive states are projected to a shared token space with the VLM to match \( d_{\text{VLM}} \), and successively projected into the expert's token space. +The robot proprioceptive states are projected to a shared token space with the VLM to match \( d_{\text{VLM}} \), and successively projected into the expert's token space. Similarily to \pizero, SmolVLA adopts separate experts communicating exclusively through self-attention layers, which however do not employ blockwise causal attention masking and rather favour simple causal masking. In contrast with \pizero, the action expert interleaves \emph{cross-attention} (CA) and \emph{self-attention} (SA) layers, a choice shown to yield higher success and smoother action chunks in practice. -While in the expert SA layers tokens are used to obtain queries, keys and values, CA layers use action tokens only as queries, and instead project visual, language and proprioperceptive tokens from the VLM backbone to a shared embedding space to then obtain keys and values. +While in the expert SA layers tokens are used to obtain queries, keys and values, CA layers use action tokens only as queries, and instead project visual, language and proprioceptive tokens from the VLM backbone to a shared embedding space to then obtain keys and values. Notably, keys and values can be cached here as well, resulting in performance gains at inference time. SmolVLA also trims down both token and layer compute.