www.dlr.de · Antonin RAFFIN · Ingredients for Learning Locomotion Directly on Real Hardware · Humanoids Workshop 2024 · 22.11.2024

Ingredients for Learning Locomotion
Directly on Real Hardware

Antonin RAFFIN (@araffin.bsky.social)
German Aerospace Center (DLR)
https://araffin.github.io/

Boring problems are important

Start simple!

Gall's Law

A complex system that works is invariably found to have evolved from a simple system that worked.

Motivation

Learning directly on real robots

Simulation to reality

Rudin, Nikita, et al. "Learning to walk in minutes using massively parallel deep reinforcement learning." CoRL, 2021.

ISS Experiment (1)

Credit: ESA/NASA

ISS Experiment (2)

DLR bert

Before

DLR bert

After, with the 1kg arm

Can it turn?

Can it still turn?

Additional Video

2nd Mission

DLR bert

Before

DLR bert

After, new arm position + magnet

Challenges of real robot training

  1. (Exploration-induced) wear and tear
  2. Sample efficiency
    ➜ one robot, manual resets
  3. Real-time constraints
    ➜ no pause, no acceleration, multi-day training
  4. Computational resource constraints

Outline

  1. Careful Task Design
  2. Using Prior Knowledge
  3. Safety Layers
  4. Robust Hardware
  5. RL Software

RL 101

RL in Practice: Tips and Tricks - Video

Task design

  • Observation space
  • Action space
  • Reward function
  • Termination conditions

Truncations for infinite horizon tasks

truncation vs termination

Example

\[\begin{aligned} \forall t, \quad r_t = 1, \quad \gamma = 0.98 \end{aligned} \]

Timeout: max_episode_steps=4

  • Without truncation handling:
    $V_\pi(s_0) = \mathop{\sum^{\textcolor{a61e4d}{3}}_{t=0}}[\gamma^t r_t] = 1 + 1 \cdot 0.98 + 0.98^2 + 0.98^3 \approx 3.9 $
  • With truncation handling:
    $V_\pi(s_0) = \mathop{\sum^{\textcolor{green}{\infty}}_{t=0}}[\gamma^t r_t] = \sum^{\textcolor{green}{\infty}}_{t=0}[\gamma^t] = \frac{1}{1 - \gamma} \approx 50 $

Recall: DQN Update

  1. DQN loss:
    \[\begin{aligned} \mathcal{L} = \mathop{\mathbb{E}}[(\textcolor{#a61e4d}{y_t} - \textcolor{#1864ab}{Q_\theta(s_t, a_t)} )^2] \end{aligned} \]
  2. Regression $ \textcolor{#1864ab}{f_\theta(x)} = \textcolor{#a61e4d}{y}$ with input $\textcolor{#1864ab}{x}$ and target $\textcolor{#a61e4d}{y}$:
    • input: $\textcolor{#1864ab}{x = (s_t, a_t)}$
    • if $s_{t+1}$ is non terminal:    $y = r_t + \gamma \cdot \max_{a' \in A}(Q_\theta(s_{t+1}, a'))$
    • if $s_{t+1}$ is terminal:             $\textcolor{a61e4d}{y = r_t}$
    • if $s_{t+1}$ is truncation:        $y = r_t + \gamma \cdot \max_{a' \in A}(Q_\theta(s_{t+1}, a'))$

In Practice

  1. Careful Task Design
  2. Using Prior Knowledge
  3. Safety Layers
  4. Robust Hardware
  5. RL Software

Prior knowledge?

  • Generality in algo vs specifity in task design
  • Reduce search space
  • Safer

Example

An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks

Raffin et al. "An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks", RLJ 2024.

Periodic Policy

\[\begin{aligned} q^{\text{des}}_i(t) &= \textcolor{#006400}{a_i} \cdot \sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i}) + \textcolor{#6d071a}{b_i} \\ \dot{\theta_i}(t) &= \begin{cases} \textcolor{#0b7285}{\omega_\text{swing}} &\text{if $\sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i})) > 0$}\\ \textcolor{#862e9c}{\omega_\text{stance}} &\text{otherwise.} \end{cases} \end{aligned} \]

Cost of generality vs prior knowledge

Leverage Prior Knowledge

CPG RL

Learning to Exploit Elastic Actuators

Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" In preparation, 2023.

  1. Careful Task Design
  2. Using Prior Knowledge
  3. Safety Layers
  4. Robust Hardware
  5. RL Software

How not to a break a robot?

1. Hard Constraints, safety layers

Padalkar, Abhishek, et al. "Guiding Reinforcement Learning with Shared Control Templates." ICRA 2023.

Cybathlon Challenge

Quere, Gabriel, et al. "Shared control templates for assistive robotics." ICRA, 2020.

2. Safer Exploration

gSDE vs Independent noise

Smooth Exploration for RL

Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.

  1. Careful Task Design
  2. Using Prior Knowledge
  3. Safety Layers
  4. Robust Hardware
  5. RL Software

DLR David

MIT Mini-Cheetah

Failures

  1. Careful Task Design
  2. Using Prior Knowledge
  3. Safety Layers
  4. Robust Hardware
  5. RL Software

Stable-Baselines3 (SB3)

Reliable RL Implementations

https://github.com/DLR-RM/stable-baselines3

Reproducible Reliable RL: SB3 + RL Zoo

RL Zoo: Reproducible Experiments

https://github.com/DLR-RM/rl-baselines3-zoo

  • Training, loading, plotting, hyperparameter optimization
  • W&B integration
  • 200+ trained models with tuned hyperparameters

Which algorithm to choose?

Algo flow

Recent Advances: Jax and JIT

Up to 20x faster!

SB3 vs SBX

Stable-Baselines3 (PyTorch) vs SBX (Jax)

Recent Off-policy RL Algorithms

  • TQC: distributional critic
  • DroQ: ensembling with dropout, higher replay ratio
  • CrossQ: with BN, without target net
  • Simba, BRO: bigger, residual net

Recent Advances: DroQ

More gradient steps: 4x more sample efficient!

DroQ vs SAC

Also have a look at TQC, TD7 and CrossQ.

RL from scratch in 10 minutes

Using SB3 + Jax = SBX: https://github.com/araffin/sbx

Challenges of real robot training (2)

  1. (Exploration-induced) wear and tear
    smooth exploration
  2. Sample efficiency
    prior knowledge, recent algorithms
  3. Real-time constraints
    fast implementations, reproducible experiments
  4. Computational resource constraints
    simple controller, deploy with ONNX

Conclusion

  • Ingredients for task design
  • Leverage prior knowledge
  • Safety layers
  • Robust hardware
  • Reliable and fast software
  • Start simple

Questions?

Backup Slides

1. Task Design (action space)

Ex: Controlling tendons forces instead of motor positions

Elastic Neck

Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.

Who am I?

SB

Stable-Baselines

ENSTAR

bert

HASy

David (aka HASy)

DLR

German Aerospace Center (DLR)

Simulation is all you need?

Plotting


							python -m rl_zoo3.cli all_plots -a sac -e HalfCheetah Ant -f logs/ -o sac_results
							python -m rl_zoo3.cli plot_from_file -i sac_results.pkl -latex -l SAC --rliable
						

Best Practices for Empirical RL

It doesn't work!

  • Start simple/simplify, iterate quickly
  • Did you follow the best practices?
  • Use trusted implementations
  • Increase budget
  • Hyperparameter tuning (Optuna)
  • Minimal implementation

RL Zoo: Reproducible Experiments

https://github.com/DLR-RM/rl-baselines3-zoo

  • Training, loading, plotting, hyperparameter optimization
  • W&B integration
  • 200+ trained models with tuned hyperparameters

In practice


								# Train an SAC agent on Pendulum using tuned hyperparameters,
								# evaluate the agent every 1k steps and save a checkpoint every 10k steps
								# Pass custom hyperparams to the algo/env
								python -m rl_zoo3.train --algo sac --env Pendulum-v1 --eval-freq 1000 \
								    --save-freq 10000 -params train_freq:2 --env-kwargs g:9.8
							

								sac/
								└── Pendulum-v1_1 # One folder per experiment
								    ├── 0.monitor.csv # episodic return
								    ├── best_model.zip # best model according to evaluation
								    ├── evaluations.npz # evaluation results
								    ├── Pendulum-v1
										│   ├── args.yml # custom cli arguments
										│   ├── config.yml # hyperparameters
								    │   └── vecnormalize.pkl # normalization
								    ├── Pendulum-v1.zip # final model
								    └── rl_model_10000_steps.zip # checkpoint

							

Learning to race in an hour

Hyperparameters Study - Learning To Race

Questions?