www.dlr.de · Antonin RAFFIN · Direct Policy Search Tutorial · RL Summer School Milano · 10.06.2026

Direct Policy Search
(BBO & PG)

Antonin RAFFIN (@araffin.bsky.social)
German Aerospace Center (DLR)
https://araffin.github.io/

GitHub Repository

https://github.com/araffin/rlss26-pg-tutorial

Aims

  • Learn directly a policy (no intermediate $Q^\pi_\theta(s, a)$)
  • Policy from a control perspective
  • PG from a classification perspective

Outline

  1. From line following to autonomous racing
  2. Bang-bang and PD controller
  3. Black-box optimization
  4. From classification to policy gradient

Line following (2014)

Line following and autonomous racing

Learning to race in minutes

Learning to race in an hour

Differential-Drive Robot

Dynamics

Observation Space

Dynamics

                                obs_high = np.array(
                                    [
                                        self.off_track_threshold,  # lateral error
                                        np.inf,  # lateral error derivative
                                    ],
                                    dtype=np.float32,
                                )
                                self.observation_space = spaces.Box(low=-obs_high, high=obs_high)
                                # Later:       [lateral_error, heading_error, forward_velocity,
                                #               angular_velocity, left_wheel_speed, right_wheel_speed,
                                #               curvature, lookahead_lat_2, lookahead_lat_4, lookahead_lat_6]
                            

Action Space


                                # Action: [left_wheel_speed, right_wheel_speed]
                                left_wheel_speed = base_speed + steering
                                right_wheel_speed = base_speed - steering
                                # 1-D action  →  steering ∈ [-1, 1]
                                self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(1,))
                            

Reward

Stay close to the line while moving forward


                            # Note: always normalize!
                            lateral_penalty = -((lateral_error / self.off_track_threshold) ** 2)
                            alive_bonus = 1.0  # otherwise might try to terminate early
                            reward = alive_bonus + lateral_penalty + forward_velocity
                        

What can be changed for racing?

Termination conditions?

Move forward and stay on the track


                            off_track = abs(lateral_error) > self.off_track_threshold
                            terminated = off_track or going_reverse
                            truncated = self.step_count >= self.max_episode_steps  # timeout
                        

Note: timeout/truncation needs special handling in the algorithm

  1. From line following to autonomous racing
  2. Bang-bang and PD controller
  3. Black-box optimization
  4. From classification to policy gradient

Bang-Bang Control


                            if lateral_error > 0:
                                action = STEER_LEFT
                            else:
                                action = STEER_RIGHT

                        

PD Control

PD Controller as Policy

$a_t = \textcolor{#1864ab}{K_p} \textcolor{#a61e4d}{e_t} + \textcolor{#1864ab}{K_d} \textcolor{#a61e4d}{\frac{e_t - e_{t -1}}{\Delta t}}$

$a_t = \begin{bmatrix} \textcolor{#1864ab}{K_p} & \textcolor{#1864ab}{K_d} \end{bmatrix} \cdot \begin{bmatrix} \textcolor{#a61e4d}{e_t} \\ \textcolor{#a61e4d}{\frac{e_t - e_{t -1}}{\Delta t}} \end{bmatrix}$

$a_t = \textcolor{#1864ab}{\theta}^\top \textcolor{#a61e4d}{s_t}$     

$\pi_{\textcolor{#1864ab}{\theta}}(\textcolor{#a61e4d}{s_t}) = \textcolor{#1864ab}{\theta}^\top \textcolor{#a61e4d}{s_t}$      a linear policy!

Questions?

  1. From line following to autonomous racing
  2. Bang-bang and PD controller
  3. Black-box optimization
  4. From classification to policy gradient

How to find a good policy?

$\pi_{\textcolor{#1864ab}{\theta}}(\textcolor{#a61e4d}{s_t}) = \begin{bmatrix} \textcolor{#1864ab}{K_p} & \textcolor{#1864ab}{K_d} \end{bmatrix}^\top \textcolor{#a61e4d}{s_t} = \textcolor{#1864ab}{\theta}^\top \textcolor{#a61e4d}{s_t}$

  • how to find $K_p$, $K_d$ automatically?
  • how to extend $s_t$?
    ex: $s_t = [e_t, \frac{e_t - e_{t -1}}{\Delta t}, v_t, \text{curv}_t \ldots]$
  • more complex policy $\pi_\theta$?

Black-Box Optimization

$\theta^* = \text{argmin}_{\theta}{J(\theta)}$

\[\begin{aligned} J(\theta) = -\mathop{\mathbb{E}}[\sum_{t=0}^T r_t]. \end{aligned} \]

Episodic RL?

Transition: Finite Difference

Transition: Finite Difference (2)

Idea: $\nabla_{\theta}J(\theta) \approx \frac{J(\theta + \delta\theta) - J(\theta)}{\delta\theta}$
\[ \nabla J(\theta) \approx \frac{1}{2 N \epsilon} \sum_{i=1}^N \left[ J(\theta^{+}_i) - J(\theta^{-}_i) \right] \mathbf{u}_i \]
$$\theta^{t+1} = \theta^t + \eta \nabla_{\theta}J(\theta^t)$$

Questions?

PD Control and Black Box Optimization (1st notebook)

https://github.com/araffin/rlss26-pg-tutorial
  1. From line following to autonomous racing
  2. Bang-bang and PD controller
  3. Black-box optimization
  4. From classification to policy gradient

BBO Limitations

  • Does not use $r_t$, only $R(\tau)$
  • Does not scale when $\pi_\theta$ is more complex
  • Policy gradient?

Classification 101

Cross-Entropy Loss

RL as classification (1)

RL as classification (2)

  • $R$ acts as weight
  • $R > 0$ (win -> reinforce the taken action)
  • $R < 0$ (lose -> make the taken action less likely)

Policy Gradient Loss (reminder)

Episodic Policy Gradient
(in practice)

Policy Gradient (full)

Some remarks

  • Noisy gradient
  • Scales with action/policy dim
  • Not a true gradient for the discounted case

Questions?

Policy Gradient (2nd notebook)

https://github.com/araffin/rlss26-pg-tutorial

Backup Slides