Sample efficiency
Ex: real robot, slow simulation
Speed
Ex: fast simulation on GPU, slow algorithm
Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." (2013).
Maximize the sum of discounted reward
How good is it to take action $a$ in state $s$?
\[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]
Discrete actions: \[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]
Learn to maximize the $Q$-function using $\pi_{\phi}$.
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." (2015).
TD3: select the min of $Q^1_\theta$ and $Q^2_\theta$
Fujimoto, Scott, Herke Hoof, and David Meger. "Addressing function approximation error in actor-critic methods." (2018).
SAC $\approx$ DQN + DDPG + TD3 + Maximum entropy RL
Maximum entropy RL: encourage exploration while still solving the task
Ex: Avoid the variance of the Gaussian distribution to collapse too early
Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." (2018).
Same state $s_t$, same action $a_t$, different outcome $r(s_t, a_t)$
TQC $\approx$ SAC + quantile regression (truncated)
Kuznetsov, Arsenii, et al. "Controlling overestimation bias with truncated mixture of continuous distributional quantile critics." (2020).
Idea: re-use samples from the replay buffer more
Issue: Naive scaling doesn't work (overestimation, extrapolation errors, ...)
Solution? explicit (REDQ)/ implicit (DroQ) ensembles, regularization, ...
Chen, Xinyue, et al. "Randomized ensembled double q-learning: Learning fast without a model." (2021).
Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." (2021).
D'Oro, Pierluca, et al. "Sample-efficient reinforcement learning by breaking the replay ratio barrier." (2022).
Hussing, Marcel, et al. "Dissecting deep rl with high update ratios: Combatting value overestimation and divergence." (2024).
Note: policy delay = replay ratio (RR) for both SAC and DroQ
Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." (2021).
Using SB3 + Jax = SBX: https://github.com/araffin/sbx
Lee, Hojoon, et al. "Simba: Simplicity bias for scaling up parameters in deep reinforcement learning." (2024).
Note: can be combined with TQC/DroQ (see also CrossQ, TD7, SimBaV2, ...)
Stable-Baselines3 (PyTorch) vs SBX (Jax)
PyTorch compile: LeanRL(5x boost)
Thousands of robots in parallel, learn in minutes
Ex: MJX (MuJoCo), Isaac Sim, Genesis, ...