www.dlr.de · Antonin RAFFIN · Enabling Reinforcement Learning on Real Robots · PhD Defense · 31.01.2025

Enabling Reinforcement Learning
on Real Robots

all robots
Antonin RAFFIN (@araffin.bsky.social)
German Aerospace Center (DLR)
https://araffin.github.io/

RL 101

Motivation

Learning directly on real robots

Simulation to reality

In the papers...

Rudin, Nikita, et al. "Learning to walk in minutes using massively parallel deep reinforcement learning." CoRL, 2021.

Simulation to reality (2)

...in reality.

Duclusaud, Marc, et al. "Extended Friction Models for the Physics Simulation of Servo Actuators." (2024)

ISS Experiment (1)

Credit: ESA/NASA

ISS Experiment (2)

DLR bert

Before

DLR bert

After, with the 1kg arm

Can it turn?

Can it still turn?

Outdoor

Challenges of real robot training

  1. (Exploration-induced) wear and tear
  2. Sample efficiency
    ➜ one robot, manual resets
  3. Real-time constraints
    ➜ no pause, no acceleration, multi-day training
  4. Computational resource constraints

Outline

  1. Reliable Software Tools for RL
  2. Smooth Exploration
  3. Integrating Pose Estimation
  4. Combining Oscillators and RL

Stable-Baselines3 (SB3)

Reliable RL Implementations

https://github.com/DLR-RM/stable-baselines3

Raffin, Antonin, et al. "Stable-baselines3: Reliable reinforcement learning implementations." JMLR (2021)

Reliable Implementations?

  • Performance checked
  • Software best practices (96% code coverage, type checked, ...)
  • Active community (9000+ stars, 2700+ citations, 7M+ downloads)
  • Fully documented

Reproducible Reliable RL: SB3 + RL Zoo

RL Zoo: Reproducible Experiments

https://github.com/DLR-RM/rl-baselines3-zoo

  • Training, loading, plotting, hyperparameter optimization
  • Everything that is needed to reproduce the experiment is logged
  • 200+ trained models with tuned hyperparameters

SBX: A Faster Version of SB3

SB3 vs SBX

Stable-Baselines3 (PyTorch) vs SBX (Jax)

Recent Advances: DroQ

More gradient steps: 4x more sample efficient!

DroQ vs SAC

RL from scratch in 10 minutes

Using SB3 + Jax = SBX: https://github.com/araffin/sbx

  1. Reliable Software Tools for RL
  2. Smooth Exploration
  3. Integrating Pose Estimation
  4. Combining Oscillators and RL

Smooth Exploration for Robotic RL

gSDE vs Independent noise

Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.

generalized State-Dependent Exploration (gSDE)

Independent Gaussian noise: \[ \epsilon_t \sim \mathcal{N}(0, \sigma) \] \[ a_t = \mu(s_t; \theta_{\mu}) + \epsilon_t \]
State dependent exploration: \[ \theta_{\epsilon} \sim \mathcal{N}(0, \sigma_{\epsilon}) \] \[ a_t = \mu(s_t; \theta_{\mu}) + \epsilon(s_t; \theta_{\epsilon}) \]
gSDE vs Independent noise

Trade-off between
return and continuity cost

Results

  1. Reliable Software Tools for RL
  2. Smooth Exploration
  3. Integrating Pose Estimation
  4. Combining Oscillators and RL

Fault-tolerant Pose Estimation

Raffin, Antonin, Bastian Deutschmann, and Freek Stulp. "Fault-tolerant six-DoF pose estimation for tendon-driven continuum mechanisms." Frontiers in Robotics and AI, 2021.

Method

Integrating Pose Estimation with RL

FF + RL

Neck Control Results

Pose Prediction Results

  1. Reliable Software Tools for RL
  2. Smooth Exploration
  3. Integrating Pose Estimation
  4. Combining Oscillators and RL

An Open-Loop Baseline for RL Locomotion Tasks

Raffin et al. "An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks", RLJ 2024.

Periodic Policy

\[\begin{aligned} q^{\text{des}}_i(t) &= \textcolor{#006400}{a_i} \cdot \sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i}) + \textcolor{#6d071a}{b_i} \\ \dot{\theta_i}(t) &= \begin{cases} \textcolor{#0b7285}{\omega_\text{swing}} &\text{if $\sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i})) > 0$}\\ \textcolor{#862e9c}{\omega_\text{stance}} &\text{otherwise.} \end{cases} \end{aligned} \]

Cost of generality vs prior knowledge

Combining Open-Loop Oscillators with RL

CPG RL

Learning to Exploit Elastic Actuators

Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.

Challenges of real robot training (2)

  1. (Exploration-induced) wear and tear
    smooth exploration, feedforward controller, open-loop oscillators
  2. Sample efficiency
    prior knowledge, recent algorithms
  3. Real-time constraints
    fast implementations, reproducible experiments
  4. Computational resource constraints
    open-loop oscillators, deploy with ONNX, fast pose estimation

Conclusion

  • High quality software
  • Safer exploration
  • Leverage prior knowledge
  • Future: pre-train in sim, fine-tune on real hardware?

Questions?

Backup Slides

Additional Video

2nd Mission

DLR bert

Before

DLR bert

After, new arm position + magnet

Broken leg

Elastic Neck

Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.

Pose Estimation Results

Simulation is all you need?

Parameter efficiency?

Plotting


							python -m rl_zoo3.cli all_plots -a sac -e HalfCheetah Ant -f logs/ -o sac_results
							python -m rl_zoo3.cli plot_from_file -i sac_results.pkl -latex -l SAC --rliable
						

RL Zoo: Reproducible Experiments

https://github.com/DLR-RM/rl-baselines3-zoo

  • Training, loading, plotting, hyperparameter optimization
  • W&B integration
  • 200+ trained models with tuned hyperparameters

In practice


								# Train an SAC agent on Pendulum using tuned hyperparameters,
								# evaluate the agent every 1k steps and save a checkpoint every 10k steps
								# Pass custom hyperparams to the algo/env
								python -m rl_zoo3.train --algo sac --env Pendulum-v1 --eval-freq 1000 \
								    --save-freq 10000 -params train_freq:2 --env-kwargs g:9.8
							

								sac/
								└── Pendulum-v1_1 # One folder per experiment
								    ├── 0.monitor.csv # episodic return
								    ├── best_model.zip # best model according to evaluation
								    ├── evaluations.npz # evaluation results
								    ├── Pendulum-v1
										│   ├── args.yml # custom cli arguments
										│   ├── config.yml # hyperparameters
								    │   └── vecnormalize.pkl # normalization
								    ├── Pendulum-v1.zip # final model
								    └── rl_model_10000_steps.zip # checkpoint

							

Learning to race in an hour

Hyperparameters Study - Learning To Race

Questions?