www.dlr.de · Antonin RAFFIN · Enabling Reinforcement Learning on Real Robots · PhD Defense · 31.01.2025

Enabling Reinforcement Learning
on Real Robots

all robots
Antonin RAFFIN (@araffin.bsky.social)
German Aerospace Center (DLR)
https://araffin.github.io/

RL 101

Motivation

Learning directly on real robots

Simulation to reality

Rudin, Nikita, et al. "Learning to walk in minutes using massively parallel deep reinforcement learning." CoRL, 2021.

Credit: ANYbotics

Adapting quickly: Retrained from Space

Adapting quickly: different dynamics

DLR bert

Before (3kg)

DLR bert

After, with the 1kg arm

Turning without a hip joint

Turning without a hip joint with new dynamics

Outdoor

Retrained from space (extended)

Cavern Exploration

Challenges of real robot training

  1. (Exploration-induced) wear and tear
  2. Sample efficiency
    ➜ one robot, manual resets
  3. Real-time constraints
    ➜ no pause, no acceleration, multi-day training
  4. Computational resource constraints

Contributions

Outline

  1. Reliable Software Tools for RL
  2. Smooth Exploration for Robotic RL
  3. Combining Pose Estimation/Oscillators and RL

RL is Hard (Episode #4352)

A2C

Which algorithm is better?

The only difference: the epsilon value to avoid division by zero in the optimizer (one is eps=1e-7 the other eps=1e-5)

RL is Hard (Episode #5623)

There is only one line of code that is different.

Stable-Baselines3 (SB3)

Reliable RL Implementations

https://github.com/DLR-RM/stable-baselines3

Raffin, Antonin, et al. "Stable-baselines3: Reliable reinforcement learning implementations." JMLR (2021)

Reliable Implementations?

  • Performance checked
  • Software best practices (96% code coverage, type checked, ...)
  • Active community (11k+ stars, 3500+ citations, 10M+ downloads)
  • Fully documented

Reproducible Reliable RL: SB3 + RL Zoo

RL Zoo: Reproducible Experiments

https://github.com/DLR-RM/rl-baselines3-zoo

  • Training, loading, plotting, hyperparameter optimization
  • Everything that is needed to reproduce the experiment is logged
  • 200+ trained models with tuned hyperparameters

SBX: A Faster Version of SB3

SB3 vs SBX

Stable-Baselines3 (PyTorch) vs SBX (Jax)

More gradient steps to improve sample efficiency

DroQ vs SAC

RL from scratch in 10 minutes

Using SB3 + Jax = SBX: https://github.com/araffin/sbx

  1. Reliable Software Tools for RL
  2. Smooth Exploration for Robotic RL
  3. Combining Pose Estimation/Oscillators and RL

generalized State-Dependent Exploration

gSDE vs Independent noise
Independent Gaussian noise: \[ \epsilon_t \sim \mathcal{N}(0, \sigma) \] \[ a_t = \mu(s_t; \theta_{\mu}) + \epsilon_t \]
gSDE: \[ \theta_{\epsilon} \sim \mathcal{N}(0, \sigma_{\epsilon}) \] \[ a_t = \mu(s_t; \theta_{\mu}) + \epsilon(z_t; \theta_{\epsilon}) \]

Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.

Trade-off between
return and continuity cost

Results

  1. Reliable Software Tools for RL
  2. Smooth Exploration for Robotic RL
  3. Combining Pose Estimation/Oscillators and RL

An Open-Loop Baseline for RL Locomotion Tasks

Perodic Policy

\[\begin{aligned} q^{\text{des}}_i(t) &= \textcolor{#006400}{a_i} \cdot \sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i}) + \textcolor{#6d071a}{b_i} \\ \dot{\theta_i}(t) &= \begin{cases} \textcolor{#0b7285}{\omega_\text{swing}} &\text{if $\sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i})) > 0$}\\ \textcolor{#862e9c}{\omega_\text{stance}} &\text{otherwise.} \end{cases} \end{aligned} \]

Raffin et al. "An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks", RLJ 2024.
Outstanding Paper Award on Empirical Resourcefulness in RL

Cost of generality vs prior knowledge

Combining Open-Loop Oscillators and RL

CPG RL

Learning to Exploit Elastic Actuators

RL from scratch
0.14 m/s

Open-Loop Oscillators Hand-Tuned
0.16 m/s

Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.

Learning to Exploit Elastic Actuators (2)

Open-Loop Oscillators Hand-Tuned
0.16 m/s

Open-Loop Oscillators Hand-Tuned + RL
0.19 m/s

Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.

Learning to Exploit Elastic Actuators (2)

Open-Loop Oscillators Optimized
0.26 m/s

Open-Loop Oscillators Optimized + RL
0.34 m/s

Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" 2023.

Feedforward controller
from pose estimation

Integrating Pose Estimation and RL

FF + RL

Neck Control Results

Challenges of real robot training (2)

  1. (Exploration-induced) wear and tear
    smooth exploration, feedforward controller, open-loop oscillators
  2. Sample efficiency
    prior knowledge, recent algorithms
  3. Real-time constraints
    fast implementations, reproducible experiments
  4. Computational resource constraints
    open-loop oscillators, deploy with ONNX, fast pose estimation

Conclusion

  • High quality RL software
  • Safer exploration
  • Leverage prior knowledge
  • Future: pre-train in sim, fine-tune on real hardware?

Questions?

Backup Slides

Simulation to reality (2)

...in reality.

Duclusaud, Marc, et al. "Extended Friction Models for the Physics Simulation of Servo Actuators." (2024)

Additional Video

2nd Mission

DLR bert

Before

DLR bert

After, new arm position + magnet

Broken leg

Elastic Neck

Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.

Fault-tolerant Pose Estimation

Raffin, Antonin, Bastian Deutschmann, and Freek Stulp. "Fault-tolerant six-DoF pose estimation for tendon-driven continuum mechanisms." Frontiers in Robotics and AI, 2021.

Method

Pose Prediction Results

Pose Estimation Results

Simulation is all you need?

Parameter efficiency?

Plotting


                            python -m rl_zoo3.cli all_plots -a sac -e HalfCheetah Ant -f logs/ -o sac_results
                            python -m rl_zoo3.cli plot_from_file -i sac_results.pkl -latex -l SAC --rliable
                        

RL Zoo: Reproducible Experiments

https://github.com/DLR-RM/rl-baselines3-zoo

  • Training, loading, plotting, hyperparameter optimization
  • W&B integration
  • 200+ trained models with tuned hyperparameters

In practice


                                # Train an SAC agent on Pendulum using tuned hyperparameters,
                                # evaluate the agent every 1k steps and save a checkpoint every 10k steps
                                # Pass custom hyperparams to the algo/env
                                python -m rl_zoo3.train --algo sac --env Pendulum-v1 --eval-freq 1000 \
                                    --save-freq 10000 -params train_freq:2 --env-kwargs g:9.8
                            

                                sac/
                                └── Pendulum-v1_1 # One folder per experiment
                                    ├── 0.monitor.csv # episodic return
                                    ├── best_model.zip # best model according to evaluation
                                    ├── evaluations.npz # evaluation results
                                    ├── Pendulum-v1
                                        │   ├── args.yml # custom cli arguments
                                        │   ├── config.yml # hyperparameters
                                    │   └── vecnormalize.pkl # normalization
                                    ├── Pendulum-v1.zip # final model
                                    └── rl_model_10000_steps.zip # checkpoint

                            

Learning to race in an hour

Hyperparameters Study - Learning To Race

Questions?