An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks

www.dlr.de · Antonin RAFFIN · Open-Loop Baseline for RL Locomotion · RLC 2024 · 10.08.2024

An Open-Loop Baseline for
RL Locomotion Tasks

Antonin RAFFIN ( @araffin2 )
German Aerospace Center (DLR)
https://araffin.github.io/

Motivation

Current trend: increasingly complex algorithms
SOTA RL struggle with seemingly simple problems
Baseline for locomotion: periodic policy

Open-Loop Oscillators

\[\begin{aligned} q^{\text{des}}_i(t) &= \textcolor{#006400}{a_i} \cdot \sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i}) + \textcolor{#6d071a}{b_i} \\ \dot{\theta_i}(t) &= \begin{cases} \textcolor{#0b7285}{\omega_\text{swing}} &\text{if $\sin(\theta_i(t) + \textcolor{#5f3dc4}{\varphi_i})) > 0$}\\ \textcolor{#862e9c}{\omega_\text{stance}} &\text{otherwise.} \end{cases} \end{aligned} \]

One oscillator per joint and the frequency is shared between the joints

35 lines of code

Simulation Results

Parameter efficiency?

27000x fewer parameters than SAC

Runtime?

Real Robot Experiments

Sim2real transfer without randomization or reward engineering

Cost of generality vs prior knowledge

International Space Station Mission

Training directly on the robot in 30 minutes.

Additional Video

Conclusion

Minimal prior knowledge
DRL pitfalls
Cost of generality
Limitations

Questions?

Backup slides

35 lines of code


												import gymnasium as gym
												import numpy as np
												from gymnasium.envs.mujoco.mujoco_env import MujocoEnv

												# Env initialization
												env = gym.make("Swimmer-v4", render_mode="human")
												# Wrap to have reward statistics
												env = gym.wrappers.RecordEpisodeStatistics(env)
								mujoco_env = env.unwrapped
								n_joints = 2
								assert isinstance(mujoco_env, MujocoEnv)
								# PD Controller gains
								kp, kd = 10, 0.5
								# Reset the environment
								t, _ = 0.0, env.reset(seed=0)
								# Oscillators parameters
								omega = 2 * np.pi * 0.62 * np.ones(n_joints)
								phase = 2 * np.pi * np.array([0.00, 0.95])

								while True:
								    env.render()
								    # Open-Loop Control using oscillators
								    desired_qpos = np.sin(omega * t + phase)
								    # PD Control: convert to torque, desired qvel is zero
								    desired_torques = (
								        kp * (desired_qpos - mujoco_env.data.qpos[-n_joints:])
								        - kd * mujoco_env.data.qvel[-n_joints:]
								    )
								    desired_torques = np.clip(desired_torques, -1.0, 1.0)  # clip to action bounds
								    _, reward, terminated, truncated, info = env.step(desired_torques)
								    t += mujoco_env.dt

								    if terminated or truncated:
								        print(f"Episode return: {float(info['episode']['r']):.2f}")
								        t, _ = 0.0, env.reset()

Performance?

on-par with PPO

An Open-Loop Baseline for
RL Locomotion Tasks

Motivation

Open-Loop Oscillators

35 lines of code

Simulation Results

Parameter efficiency?

Runtime?

Real Robot Experiments

Cost of generality vs prior knowledge

International Space Station Mission

Additional Video

Conclusion

Questions?

Backup slides

35 lines of code

Performance?

Performance? (2/2)

Robustness to noise

Ablation Study

Results

An Open-Loop Baseline for RL Locomotion Tasks

Motivation

Open-Loop Oscillators

35 lines of code

Simulation Results

Parameter efficiency?

Runtime?

Real Robot Experiments

Cost of generality vs prior knowledge

International Space Station Mission

Additional Video

Conclusion

Questions?

Backup slides

35 lines of code

Performance?

Performance? (2/2)

Robustness to noise

Ablation Study

Results

An Open-Loop Baseline for
RL Locomotion Tasks