Stable-Baselines
bert
David (aka HASy)
German Aerospace Center (DLR)
Which algorithm is better?
The only difference: the epsilon $\epsilon$ value to avoid division by zero in the optimizer
(one is $\epsilon$ = 1e-7
the other $\epsilon$ = 1e-5)
Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Neurips (2021)
# Training budget (cap the max number of iterations)
N_STEPS = 1000
def test_ppo():
agent = PPO("MlpPolicy", "CartPole-v1").learn(N_STEPS)
# Evaluate the trained agent
episodic_return = evaluate_policy(agent, n_eval_episodes=20)
# check that the performance is above a given threshold
assert episodic_return > 90
Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.
# Train an SAC agent on Pendulum using tuned hyperparameters,
# evaluate the agent every 1k steps and save a checkpoint every 10k steps
# Pass custom hyperparams to the algo/env
python -m rl_zoo3.train --algo sac --env Pendulum-v1 --eval-freq 1000 \
--save-freq 10000 -params train_freq:2 --env-kwargs g:9.8
sac/
└── Pendulum-v1_1 # One folder per experiment
├── 0.monitor.csv # episodic return
├── best_model.zip # best model according to evaluation
├── evaluations.npz # evaluation results
├── Pendulum-v1
│ ├── args.yml # custom cli arguments
│ ├── config.yml # hyperparameters
│ └── vecnormalize.pkl # normalization
├── Pendulum-v1.zip # final model
└── rl_model_10000_steps.zip # checkpoint
python -m rl_zoo3.cli all_plots -a sac -e HalfCheetah Ant -f logs/ -o sac_results
python -m rl_zoo3.cli plot_from_file -i sac_results.pkl -latex -l SAC --rliable
Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" In preparation, 2023.
The 37 Implementation Details of Proximal Policy Optimization
Iterate quickly!
Log useful values, ipdb, visualize
Easy ➤ Medium ➤ Hard
More in the backup slides | 7 mistakes challenge
Using SB3 + Jax = SBX: https://github.com/araffin/sbx
A Simple Open-Loop Baseline for RL Locomotion Tasks
Raffin et al. "A Simple Open-Loop Baseline for RL Locomotion Tasks" In preparation, ICLR 2024.
ipdb, log useful values, visualize)
The 37 Implementation Details of Proximal Policy Optimization
Nuts and Bolts of Deep RL Experimentation
# Note: done = terminated or truncated
# Offpolicy algorithms
# If the episode is terminated, set the target to the reward
should_bootstrap = np.logical_not(replay_data.terminateds)
# 1-step TD target
td_target = replay_data.rewards + should_bootstrap * (gamma * next_q_values)
# On-policy algorithms
if truncated:
terminal_reward += gamma * next_value
import gymnasium as gym
import numpy as np
from gymnasium.envs.mujoco.mujoco_env import MujocoEnv
# Env initialization
env = gym.make("Swimmer-v4", render_mode="human")
# Wrap to have reward statistics
env = gym.wrappers.RecordEpisodeStatistics(env)
mujoco_env = env.unwrapped
n_joints = 2
assert isinstance(mujoco_env, MujocoEnv)
# PD Controller gains
kp, kd = 10, 0.5
# Reset the environment
t, _ = 0.0, env.reset(seed=0)
# Oscillators parameters
omega = 2 * np.pi * 0.62 * np.ones(n_joints)
phase = 2 * np.pi * np.array([0.00, 0.95])
while True:
env.render()
# Open-Loop Control using oscillators
desired_qpos = np.sin(omega * t + phase)
# PD Control: convert to torque, desired qvel is zero
desired_torques = (
kp * (desired_qpos - mujoco_env.data.qpos[-n_joints:])
- kd * mujoco_env.data.qvel[-n_joints:]
)
desired_torques = np.clip(desired_torques, -1.0, 1.0) # clip to action bounds
_, reward, terminated, truncated, info = env.step(desired_torques)
t += mujoco_env.dt
if terminated or truncated:
print(f"Episode return: {float(info['episode']['r']):.2f}")
t, _ = 0.0, env.reset()