Learning directly on real robots
Rudin, Nikita, et al. "Learning to walk in minutes using massively parallel deep reinforcement learning." CoRL, 2021.
Credit: ANYbotics
Before (3kg)
After, with the 1kg arm
Which algorithm is better?
The only difference: the epsilon value to avoid division by zero in the optimizer
(one is eps=1e-7
the other eps=1e-5
)
There is only one line of code that is different.
https://github.com/DLR-RM/stable-baselines3
Raffin, Antonin, et al. "Stable-baselines3: Reliable reinforcement learning implementations." JMLR (2021)
Stable-Baselines3 (PyTorch) vs SBX (Jax)
Using SB3 + Jax = SBX: https://github.com/araffin/sbx
Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.
Perodic Policy
Raffin et al. "An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks", RLJ 2024.
Outstanding Paper Award on Empirical Resourcefulness in RL
RL from scratch
0.14 m/s
Open-Loop Oscillators Hand-Tuned
0.16 m/s
Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.
Open-Loop Oscillators Hand-Tuned
0.16 m/s
Open-Loop Oscillators Hand-Tuned + RL
0.19 m/s
Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.
Open-Loop Oscillators Optimized
0.26 m/s
Open-Loop Oscillators Optimized + RL
0.34 m/s
Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.
...in reality.
Duclusaud, Marc, et al. "Extended Friction Models for the Physics Simulation of Servo Actuators." (2024)
Before
After, new arm position + magnet
Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.
Raffin, Antonin, Bastian Deutschmann, and Freek Stulp. "Fault-tolerant six-DoF pose estimation for tendon-driven continuum mechanisms." Frontiers in Robotics and AI, 2021.
python -m rl_zoo3.cli all_plots -a sac -e HalfCheetah Ant -f logs/ -o sac_results
python -m rl_zoo3.cli plot_from_file -i sac_results.pkl -latex -l SAC --rliable
# Train an SAC agent on Pendulum using tuned hyperparameters,
# evaluate the agent every 1k steps and save a checkpoint every 10k steps
# Pass custom hyperparams to the algo/env
python -m rl_zoo3.train --algo sac --env Pendulum-v1 --eval-freq 1000 \
--save-freq 10000 -params train_freq:2 --env-kwargs g:9.8
sac/
└── Pendulum-v1_1 # One folder per experiment
├── 0.monitor.csv # episodic return
├── best_model.zip # best model according to evaluation
├── evaluations.npz # evaluation results
├── Pendulum-v1
│ ├── args.yml # custom cli arguments
│ ├── config.yml # hyperparameters
│ └── vecnormalize.pkl # normalization
├── Pendulum-v1.zip # final model
└── rl_model_10000_steps.zip # checkpoint