Sample efficiency
Ex: real robot, slow simulation
Speed
Ex: fast simulation on GPU, slow algorithm
Maximize the sum of discounted reward
How good is it take action $a$ in state $s$?
\[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]
Learn to maximize the $Q$-function using $\pi_{\phi}$.
Discrete actions: \[\begin{aligned} \pi(s) = \argmax_{a \in A} Q^\pi(s, a) \end{aligned} \]
TD3: select the min of $Q^1_\theta$ and $Q^2_\theta$
SAC $\approx$ DQN + DDPG + TD3 + Maximum entropy RL
Maximum entropy RL: encourage exploration while still solving the task
TQC: SAC + quantile regression (truncated)
Idea: re-use samples from the replay buffer more
Issue: Naive scaling doesn't work (overestimation, extrapolation errors, ...)
Solution? explicit (REDQ)/ implicit (DroQ) ensembles, regularization, ...
Note: policy delay = replay ratio (RR) for both SAC and DroQ
Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning."
Using SB3 + Jax = SBX: https://github.com/araffin/sbx
Lee, Hojoon, et al. "Simba: Simplicity bias for scaling up parameters in deep reinforcement learning."
Note: can be combined with TQC/DroQ (see also CrossQ, TD7, SimBaV2, ...)
Stable-Baselines3 (PyTorch) vs SBX (Jax)
PyTorch compile: LeanRL(5x boost)
Thousands of robots in parallel, learn in minutes
Ex: MJX (MuJoCo), Isaac Sim, Genesis, ...