Which algorithm is better?
The only difference: the epsilon $\epsilon$ value to avoid division by zero in the optimizer
(one is $\epsilon$ = 1e-7
the other $\epsilon$ = 1e-5
)
Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Neurips (2021)
# Training budget (cap the max number of iterations)
N_STEPS = 1000
def test_ppo():
agent = PPO("MlpPolicy", "CartPole-v1").learn(N_STEPS)
# Evaluate the trained agent
episodic_return = evaluate_policy(agent, n_eval_episodes=20)
# check that the performance is above a given threshold
assert episodic_return > 90
Raffin, Antonin, Jens Kober, and Freek Stulp. "Smooth exploration for robotic reinforcement learning." CoRL. PMLR, 2022.
# Train an SAC agent on Pendulum using tuned hyperparameters,
# evaluate the agent every 1k steps and save a checkpoint every 10k steps
# Pass custom hyperparams to the algo/env
python -m rl_zoo3.train --algo sac --env Pendulum-v1 --eval-freq 1000 \
--save-freq 10000 -params train_freq:2 --env-kwargs g:9.8
sac/
└── Pendulum-v1_1 # One folder per experiment
├── 0.monitor.csv # episodic return
├── best_model.zip # best model according to evaluation
├── evaluations.npz # evaluation results
├── Pendulum-v1
│ ├── args.yml # custom cli arguments
│ ├── config.yml # hyperparameters
│ └── vecnormalize.pkl # normalization
├── Pendulum-v1.zip # final model
└── rl_model_10000_steps.zip # checkpoint
python -m rl_zoo3.cli all_plots -a sac -e HalfCheetah Ant -f logs/ -o sac_results
python -m rl_zoo3.cli plot_from_file -i sac_results.pkl -latex -l SAC --rliable
Raffin et al. "Learning to Exploit Elastic Actuators for Quadruped Locomotion" In preparation, 2023.
The 37 Implementation Details of Proximal Policy Optimization
Iterate quickly!
Log useful values, ipdb
, visualize
Easy ➤ Medium ➤ Hard
More in the backup slides | 7 mistakes challenge
Using SB3 + Jax = SBX: https://github.com/araffin/sbx
A Simple Open-Loop Baseline for RL Locomotion Tasks
Raffin et al. "A Simple Open-Loop Baseline for RL Locomotion Tasks" In preparation, ICLR 2024.
ipdb
, log useful values, visualize)
The 37 Implementation Details of Proximal Policy Optimization
Nuts and Bolts of Deep RL Experimentation
# Note: done = terminated or truncated
# Offpolicy algorithms
# If the episode is terminated, set the target to the reward
should_bootstrap = np.logical_not(replay_data.terminateds)
# 1-step TD target
td_target = replay_data.rewards + should_bootstrap * (gamma * next_q_values)
# On-policy algorithms
if truncated:
terminal_reward += gamma * next_value
import gymnasium as gym
import numpy as np
from gymnasium.envs.mujoco.mujoco_env import MujocoEnv
# Env initialization
env = gym.make("Swimmer-v4", render_mode="human")
# Wrap to have reward statistics
env = gym.wrappers.RecordEpisodeStatistics(env)
mujoco_env = env.unwrapped
n_joints = 2
assert isinstance(mujoco_env, MujocoEnv)
# PD Controller gains
kp, kd = 10, 0.5
# Reset the environment
t, _ = 0.0, env.reset(seed=0)
# Oscillators parameters
omega = 2 * np.pi * 0.62 * np.ones(n_joints)
phase = 2 * np.pi * np.array([0.00, 0.95])
while True:
env.render()
# Open-Loop Control using oscillators
desired_qpos = np.sin(omega * t + phase)
# PD Control: convert to torque, desired qvel is zero
desired_torques = (
kp * (desired_qpos - mujoco_env.data.qpos[-n_joints:])
- kd * mujoco_env.data.qvel[-n_joints:]
)
desired_torques = np.clip(desired_torques, -1.0, 1.0) # clip to action bounds
_, reward, terminated, truncated, info = env.step(desired_torques)
t += mujoco_env.dt
if terminated or truncated:
print(f"Episode return: {float(info['episode']['r']):.2f}")
t, _ = 0.0, env.reset()