Sample efficiency*
Ex: real robot, slow simulation
Speed
Ex: fast simulation on GPU, slow algorithm
From Tabular Q-Learning to Deep Q-Learning (DQN) https://araffin.github.io/post/rl102/
From Deep Q-Learning (DQN) to Soft Actor-Critic (SAC) and Beyond https://araffin.github.io/post/rl103/
Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." (2013).
Maximize the sum of discounted reward
How good is it to take action $a$ in state $s$?
\[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]
Discrete actions: \[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]
Learn to maximize the $Q$-function using $\pi_{\phi}$.
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." (2015).
Korkmaz, Yigit, et al. "Actor-Free Continuous Control via Structurally Maximizable Q-Functions." (2025).
TD3: select the min of $Q^1_\theta$ and $Q^2_\theta$
Fujimoto, Scott, Herke Hoof, and David Meger. "Addressing function approximation error in actor-critic methods. (TD3)" (2018).
SAC $\approx$ DQN + DDPG + TD3 + Maximum entropy RL
Maximum entropy RL: encourage exploration while still solving the task
Ex: Avoid the variance of the Gaussian distribution to collapse too early
Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." (2018).
Same state $s_t$, same action $a_t$, different outcome $r(s_t, a_t)$
TQC $\approx$ SAC + quantile regression (truncated)
Kuznetsov, Arsenii, et al. "Controlling overestimation bias with truncated mixture of continuous distributional quantile critics." (2020).
Idea: re-use samples from the replay buffer more
Issue: Naive scaling doesn't work (overestimation, extrapolation errors, loss of plasticity, ...)
Solution(s)? explicit (REDQ)/ implicit (DroQ) ensembles, regularization, ...
Chen, Xinyue, et al. "Randomized ensembled double q-learning: Learning fast without a model." (2021).
Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." (2021).
D'Oro, Pierluca, et al. "Sample-efficient reinforcement learning by breaking the replay ratio barrier." (2022).
Hussing, Marcel, et al. "Dissecting deep rl with high update ratios: Combatting value overestimation and divergence." (2024).
Voelcker, Claas A., et al. "MAD-TD: Model-augmented data stabilizes high update ratio rl." (2025)
Lee, Hojoon, et al. "Hyperspherical normalization for scalable deep reinforcement learning. (SimBaV2)" (2025)
Note: policy delay = replay ratio (RR) for both SAC and DroQ
Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." (2021).
Using SB3 + Jax = SBX: https://github.com/araffin/sbx
Nauman, Michal, et al. "Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control." (2024)
Lee, Hojoon, et al. "Simba: Simplicity bias for scaling up parameters in deep reinforcement learning." (2024).
Lee, Hojoon, et al. "Hyperspherical normalization for scalable deep reinforcement learning. (SimBaV2)" (2025)
Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning. (TD7)" (2023)
Bhatt, Aditya, et al. "Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity." (2024).
Fujimoto, Scott, et al. "Towards general-purpose model-free reinforcement learning (MR.Q)" (2025).
Palenicek, Daniel, et al. "XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning." (2026).
Stable-Baselines3 (PyTorch) vs SBX (Jax)
PyTorch compile: LeanRL(5x boost)
Thousands of robots in parallel, learn in minutes
Ex: MJX (MuJoCo), Isaac Sim, Genesis, ...
Li, Zechu, et al. "Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning." (2023)
Seo, Younggyo, et al. "FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control " (2025)
Voelcker, Claas, et al. "Relative Entropy Pathwise Policy Optimization. (REPPO)" (2026)
Learning from human feedback



Note: can be combined with TQC/DroQ