www.dlr.de · Antonin RAFFIN · Recent Advances in RL for Continuous Control · CERN ML Workshop · 21.05.2025

Recent Advances in RL for Continuous Control

Antonin RAFFIN (@araffin.bsky.social)
German Aerospace Center (DLR)
https://araffin.github.io/

RL 101

Two lines of improvements

Sample efficiency
Ex: real robot, slow simulation

Speed
Ex: fast simulation on GPU, slow algorithm

Outline

  1. Advances in Algorithms
  2. Advances in Software
  3. Advances in Simulators

From DQN to SAC (in 10 minutes)

Deep Q-Network (DQN)

DQN

RL Objective

Maximize the sum of discounted reward

\[\begin{aligned} J(\pi) = \mathop{\mathbb{E}}[r_0 + \gamma r_{1} + \gamma^2 r_{2} + ...]. \end{aligned} \]
Action-Value Function: $Q$-Value

How good is it take action $a$ in state $s$?

\[\begin{aligned} Q^\pi(s, a) = \mathop{\mathbb{E}}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... | s_t=s, a_t=a]. \end{aligned} \]
Bellman equation (practical): \[\begin{aligned} Q^{\pi}(s, a) &= \mathbb{E}[r_t + \gamma \mathbb{E}_{a'\sim \pi}{Q^{\pi}(s_{t+1},a')}]. \end{aligned}\]

\[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]

DQN Components

The training loop

DQN

Extending DQN to Continuous Actions (DDPG)

Learn to maximize the $Q$-function using $\pi_{\phi}$.

\[\begin{aligned} \max_{a \in A} Q_\theta(s, a) \approx Q_\theta(s, \pi_{\phi}(s)). \end{aligned} \]

Discrete actions: \[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]

Deep Deterministic Policy Gradient (DDPG)

Overestimation bias

TD3: select the min of $Q^1_\theta$ and $Q^2_\theta$

Soft Actor-Critic (SAC)

SAC $\approx$ DQN + DDPG + TD3 + Maximum entropy RL

Maximum entropy RL: encourage exploration while still solving the task

Annotated DQN Algorithm

Beyond SAC: TQC, DroQ, SimBa, ...

Distributional RL

TQC: SAC + quantile regression (truncated)

TQC Results

Higher replay ratio (REDQ, DroQ)

Idea: re-use samples from the replay buffer more

Issue: Naive scaling doesn't work (overestimation, extrapolation errors, ...)

Solution? explicit (REDQ)/ implicit (DroQ) ensembles, regularization, ...

$Q$-value Network and Replay Ratio

SAC (RR=1)

Note: policy delay = replay ratio (RR) for both SAC and DroQ

DroQ (RR=20)

Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning."

DroQ Results

RL from scratch in 10 minutes (DroQ)

Using SB3 + Jax = SBX: https://github.com/araffin/sbx

Bigger net (BRO, SimBa, ...)

SAC
SimBa

Lee, Hojoon, et al. "Simba: Simplicity bias for scaling up parameters in deep reinforcement learning."

Note: can be combined with TQC/DroQ (see also CrossQ, TD7, SimBaV2, ...)

SimBa Results

Outline

  1. Advances in Algorithms
  2. Advances in Software
  3. Advances in Simulators

JIT compilation

Stable-Baselines3 (PyTorch) vs SBX (Jax)

PyTorch compile: LeanRL(5x boost)

Outline

  1. Advances in Algorithms
  2. Advances in Software
  3. Advances in Simulators

Massive Parallel Sim

Thousands of robots in parallel, learn in minutes

Ex: MJX (MuJoCo), Isaac Sim, Genesis, ...

Optimizing for speed

Getting SAC to Work on a Massive Parallel Simulator

Conclusion

  • More sample-efficient algorithms (TQC, DroQ, ...)
  • Faster software (Jax, Torch compile)
  • Faster simulators (MJX, Isaac Sim, ...)

Questions?