www.dlr.de · Antonin RAFFIN · Recent Advances in RL for Continuous Control | SOTA Early 2026 · Mannheim RL Workshop · 06.02.2026

Recent Advances in RL for Continuous Control

Early 2026 update | Model free RL

Antonin RAFFIN (@araffin.bsky.social)
German Aerospace Center (DLR)
https://araffin.github.io/

RL 101

Two lines of improvements

Sample efficiency*
Ex: real robot, slow simulation

Speed
Ex: fast simulation on GPU, slow algorithm

Outline

  1. RL 103 (from DQN to SAC)
  2. Improving Sample-Efficiency
  3. Faster Training

From DQN to SAC (in ~10 minutes)

From Tabular Q-Learning to Deep Q-Learning (DQN) https://araffin.github.io/post/rl102/

From Deep Q-Learning (DQN) to Soft Actor-Critic (SAC) and Beyond https://araffin.github.io/post/rl103/

Deep Q-Network (DQN)

DQN

Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." (2013).

RL Objective

Maximize the sum of discounted reward

\[\begin{aligned} J(\pi) = \mathop{\mathbb{E}}[r_0 + \gamma r_{1} + \gamma^2 r_{2} + ...]. \end{aligned} \]
Action-Value Function: $Q$-Value

How good is it to take action $a$ in state $s$?

\[\begin{aligned} Q^\pi(s, a) = \mathop{\mathbb{E}}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... | s_t=s, a_t=a]. \end{aligned} \]

\[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]

Bellman equation (practical): \[\begin{aligned} Q^{\pi}(s, a) &= \mathbb{E}[r_t + \gamma \mathbb{E}_{a'\sim \pi}{Q^{\pi}(s_{t+1},a')}]. \end{aligned}\]

DQN Components

The training loop

DQN

Extending DQN to Continuous Actions (DDPG)

Discrete actions: \[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]

Learn to maximize the $Q$-function using $\pi_{\phi}$.

\[\begin{aligned} \max_{a \in A} Q_\theta(s, a) \approx Q_\theta(s, \pi_{\phi}(s)). \end{aligned} \]

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." (2015).
Korkmaz, Yigit, et al. "Actor-Free Continuous Control via Structurally Maximizable Q-Functions." (2025).

Deep Deterministic Policy Gradient (DDPG)

Overestimation bias

TD3: select the min of $Q^1_\theta$ and $Q^2_\theta$

Fujimoto, Scott, Herke Hoof, and David Meger. "Addressing function approximation error in actor-critic methods. (TD3)" (2018).

Soft Actor-Critic (SAC)

SAC $\approx$ DQN + DDPG + TD3 + Maximum entropy RL

Maximum entropy RL: encourage exploration while still solving the task

\[\begin{aligned} J(\pi) = \mathop{\mathbb{E}}[\sum_{t}{\textcolor{darkblue}{\gamma^t r(s_t, a_t)} + \textcolor{darkgreen}{\alpha\mathcal{H}(\pi({\,\cdot\,}|s_t))}}]. \end{aligned} \]

Ex: Avoid the variance of the Gaussian distribution to collapse too early

Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." (2018).

Questions?

Annotated DQN Algorithm

Outline

  1. RL 103 (from DQN to SAC)
  2. Improving Sample-Efficiency
  3. Faster Training

Beyond SAC: TQC, DroQ, SimBa, ...

Stochastic Environments

Same state $s_t$, same action $a_t$, different outcome $r(s_t, a_t)$

Distributional RL

TQC $\approx$ SAC + quantile regression (truncated)

Kuznetsov, Arsenii, et al. "Controlling overestimation bias with truncated mixture of continuous distributional quantile critics." (2020).

Higher replay ratio (REDQ, DroQ)

Idea: re-use samples from the replay buffer more

Issue: Naive scaling doesn't work (overestimation, extrapolation errors, loss of plasticity, ...)

Solution(s)? explicit (REDQ)/ implicit (DroQ) ensembles, regularization, ...

Chen, Xinyue, et al. "Randomized ensembled double q-learning: Learning fast without a model." (2021).
Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." (2021).
D'Oro, Pierluca, et al. "Sample-efficient reinforcement learning by breaking the replay ratio barrier." (2022).
Hussing, Marcel, et al. "Dissecting deep rl with high update ratios: Combatting value overestimation and divergence." (2024).
Voelcker, Claas A., et al. "MAD-TD: Model-augmented data stabilizes high update ratio rl." (2025)
Lee, Hojoon, et al. "Hyperspherical normalization for scalable deep reinforcement learning. (SimBaV2)" (2025)

$Q$-value Network and Replay Ratio

SAC (RR=1)

Note: policy delay = replay ratio (RR) for both SAC and DroQ

DroQ (RR=20)

Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." (2021).

RL from scratch in 10 minutes (DroQ)

Using SB3 + Jax = SBX: https://github.com/araffin/sbx

Bigger net (BRO, SimBa, ...)

SAC
SimBa

Nauman, Michal, et al. "Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control." (2024)
Lee, Hojoon, et al. "Simba: Simplicity bias for scaling up parameters in deep reinforcement learning." (2024).
Lee, Hojoon, et al. "Hyperspherical normalization for scalable deep reinforcement learning. (SimBaV2)" (2025)

Additional readings

  • CrossQ (BN, removing target)
  • XQN (BN, weight norm, C51 critic)
  • TD7, MR.Q (representation learning)

Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning. (TD7)" (2023)
Bhatt, Aditya, et al. "Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity." (2024).
Fujimoto, Scott, et al. "Towards general-purpose model-free reinforcement learning (MR.Q)" (2025).
Palenicek, Daniel, et al. "XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning." (2026).

Outline

  1. RL 103 (from DQN to SAC)
  2. Improving Sample-Efficiency
  3. Faster Training

JIT compilation

Stable-Baselines3 (PyTorch) vs SBX (Jax)

PyTorch compile: LeanRL(5x boost)

Massive Parallel Sim

Thousands of robots in parallel, learn in minutes

Ex: MJX (MuJoCo), Isaac Sim, Genesis, ...

Optimizing for speed

Getting SAC to Work on a Massive Parallel Simulator (2025).
https://araffin.github.io/post/sac-massive-sim/

Fast TD3

  • Large mini-batch size (similar to PPO)
  • Bigger network
  • Distributional critic
  • and more ... (ex: n-step return)

Li, Zechu, et al. "Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning." (2023)
Seo, Younggyo, et al. "FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control " (2025)
Voelcker, Claas, et al. "Relative Entropy Pathwise Policy Optimization. (REPPO)" (2026)

Conclusion

  • More sample-efficient algorithms (TQC, DroQ, ...)
  • Faster software (Jax, Torch compile)
  • Faster simulators (MJX, Isaac Sim, ...)
  • Optimizing for speed (SAC, FastTD3, ...)
  • Current trend: more complex algorithms (combining improvements)

Questions?

Learning from human feedback

Backup slides

TQC Results

Adapting quickly: Retrained from Space

DroQ Results

SimBa Results

Note: can be combined with TQC/DroQ

PPO recipe

  • Large mini-batch size (6400 - 25600 transitions)
  • Bigger network
  • KL adaptive learning rate schedule
  • Unbounded action space

Getting SAC to Work on a Massive Parallel Simulator (2025).