Most deep reinforcement learning (RL) is based on the deterministic notion of optimality, where the optimal solution is always a deterministic policy (at least under full observability). However, a stochastic policy may be preferred due to:

  • Better exploration: Strike a balance between exploration and exploitation, in contrast to heuristic exploration methods like $\epsilon$-greedy.
  • Robustness and better generalization: A stochastic policy is more robust to adversarial perturbations and serves as a better initialization for learning new skills.

Maximum entropy RL attempts to optimize the entropy of policy in addition to the expected reward, which encourages the policy to be more stochastic. The new learning objective also has some interesting properties and leads to useful algorithms which bridge traditional value and policy based methods (Soft Q-Learning, Path Consistency Learning, Soft Actor-Critic, etc.)

Functional Derivative

In this section, we provide a quick introduction of functional derivative, which is crucial to prove some key theorems in maximum entropy RL.

Definition

A functional $J(q)$ is a mapping from function $q(x)$ to a real value. Here, we focus on a simple case of functional, which can be expressed as an integration of a function of $q(x)$:

\[J[q] = \int_a^b F(q(x), x) \, dx\]

To measure “how a small change in $q(x)$ would affect $J(q)$”, we define the functinoal derivative $\frac{\delta J}{\delta q(x)}$ as follows:

\[J[q + \epsilon \eta] - J[q] \approx \epsilon \int \frac{\delta J}{\delta q(x)} \eta(x) \, dx\]

Here, $\eta(x)$ is an arbitrary test function, similar to the role of $\Delta x$ in function derivative.

Solution

It’s easy to calculate the functional derivative of $J$ in the simple form defined above. First, adding some perturbation to $J$:

\[J[q + \epsilon \eta] = \int_a^b F(q(x) + \epsilon \eta(x), x) \, dx\]

Applying Taylor expansion to $F$ around $x$:

\[J[q + \epsilon \eta] = \int_a^b \left[F(q(x), x) \, + \frac{\partial F}{\partial q} \epsilon \eta(x)\right] dx\] \[J[q + \epsilon \eta] - J[q] = \epsilon \int_a^b \frac{\partial F}{\partial q} \eta(x) dx\]

Looking back at the definition of functional derivative, we’ll find that $\frac{\delta J}{\delta q(x)}=\frac{\partial F}{\partial q}$

Application

The conclusion above can be applied to solve the problem of entropy-regularized expectation optimization:

\[\mathcal{F}(q) = \int q(x) f(x) \, dx - \lambda \int q(x) \log q(x) \, dx\]

The functional derivative of $ \mathcal{F}(q) $ with respect to $ q(x) $ involves:

  • Expectation Term: \(\frac{\delta}{\delta q(x)} \left( \int q(x) f(x) \, dx \right) = f(x)\)

  • Entropy Term: \(\frac{\delta}{\delta q(x)} \left(- \lambda \int q(x) \log q(x) \, dx \right) = -\lambda (\log q(x) + 1)\)

Combining these:

\[\frac{\delta \mathcal{F}(q)}{\delta q(x)} = f(x) - \lambda (\log q(x) + 1)\]

Setting this to zero to find the optimal $ q(x)$:

\[f(x) - \lambda (\log q(x) + 1) = 0\]

Solving for $q(x)$:

\[q(x) = \exp\left( \frac{f(x)}{\lambda} - 1 \right)\]

Since we haven’t put any constraint to $q$, the resulting $q$ may not be a probability distribution. However, we know that adding any constant $C$ to $f(x)$ will not affect the solution $q(x)$, and there must exist a $C$ that makes the solution $q(x)$ integrate to $1$. Then, the solution $q(x)$ to the new problem will still satisfy:

\[q(x) \propto \exp\left( \frac{f(x)}{\lambda} \right)\]

Formulation

The objective of maximum entropy RL is:

\[\pi_{\text {MaxEnt }}^*=\arg \max _\pi \sum_t \mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}_t\right) \sim \rho_\pi}\left[r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_t\right)\right)\right]\]

Note this objective not only maximizes entropy at the current time step, but also encourages the policy to reach states where they will have high entropy in the future.

We can first define soft Q-function and soft value function by incorporating entropy into standard Q-function and value function:

\[\begin{aligned} & Q_{\text {soft }}^{\pi}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r_t+ \mathbb{E}_{\left(\mathbf{s}_{t+1}, \ldots\right) \sim \rho_\pi}\left[\sum_{l=1}^{\infty} \gamma^l\left(r_{t+l}+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_{t+l}\right)\right)\right)\right] \end{aligned}\] \[V_{\mathrm{soft}}^\pi\left(\mathbf{s}_t\right)=\mathbb{E}_{a'\sim \pi(\mathbf{s}_t)} \left[Q_{\mathrm{soft}}^\pi\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right) + \alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_{t+l}\right)\right)\right]\]

Note that the above definition of soft value function has been shortened by using $Q$.

To improve a policy $\pi$, we can optimize the soft value function above:

\[\hat\pi(\mathbf{s}_t)\propto \exp\left(\frac{1}{\alpha}Q_{\mathrm{soft}}^\pi\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)\]

That’s what we have proved with functional derivative.

So, the optimal soft policy has to satisfy the form below, because otherwise it can be further improved in this way.

\[\pi^*(\mathbf{s}_t)\propto \exp\left(\frac{1}{\alpha}Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)\]

We can directly plug it into the definition of soft value function, and get soft Bellman equation (Theorem 2 in Soft Q-Learning):

\[V_{\mathrm{soft}}^*\left(\mathbf{s}_t\right)=\alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}\]

When $\alpha=0$, it becomes standard Bellman equation, where the $\log \int\exp$ operator becomes $\max$.

The integration in the equation above is the normalizing constant of policy, so we can express the constant with soft value function:

\[\int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}=\exp \frac{1}{\alpha}V_{\mathrm{soft}}^*\left(\mathbf{s}_t\right)\]

Then, the optimal soft policy can be written as below (Theorem 1 in Soft Q-Learning):

\[\pi_{\text {MaxEnt }}^*\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\exp \left(\frac{1}{\alpha}\left(Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}_t\right)-V_{\mathrm{soft}}^*\left(\mathbf{s}_t\right)\right)\right)\]

Soft Q-Iteration

Soft Q-Learning is based on fixed-point iteration that resembles Q-iteration.

\[\begin{aligned} Q_{\text {soft }}\left(\mathbf{s}_t, \mathbf{a}_t\right) & \leftarrow r_t+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathbf{s}}}\left[V_{\text {soft }}\left(\mathbf{s}_{t+1}\right)\right], \forall \mathbf{s}_t, \mathbf{a}_t \\ V_{\text {soft }}\left(\mathbf{s}_t\right) & \leftarrow \alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\text {soft }}\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}, \forall \mathbf{s}_t \end{aligned}\]

The iteration will converge to \(Q^*_{\text{soft}}\) and \(V^*_{\text{soft}}\) (Theorem 3 in Soft Q-Learning). Apparently, the policy can be monotonically improved in this way, and the process would only stop when the soft Bellman equation is satisfied everywhere.

Learning soft Q function

Soft Q Learning parametrize soft Q function $Q_{\text{soft}}^\theta(s,a)$ and represent soft value function with it. At each step of soft Q-iteration:

  • First estimate $V^\theta_{\text{soft}}(s)$ with importance sampling using any policy $q$. Note that if the action space is finite, importance sampling is unnecessary, as we could directly enumerate all possible actions. \(V_{\mathrm{soft}}^\theta\left(\mathbf{s}_t\right)=\alpha \log \mathbb{E}_{q_{\mathbf{a}^{\prime}}}\left[\frac{\exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^\theta\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)}{q_{\mathbf{a}^{\prime}}\left(\mathbf{a}^{\prime}\right)}\right]\)

  • Then we could calculate the target Q function using the value function of next states. \(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r_t+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathrm{s}}}\left[V_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\)

  • Optimize the soft Q function towards the target Q function: \(J_Q(\theta)=\mathbb{E}_{\mathbf{s}_t \sim q_{s_t}, \mathbf{a}_t \sim q_{a_t}}\left[\frac{1}{2}\left(\hat{Q}_{\text {soft }}^{\bar{\theta}}\left(\mathbf{s}_t, \mathbf{a}_t\right)-Q_{\text {soft }}^\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)\right)^2\right]\)

Sampling from soft policy

The next challenge is to sample from the policy, which is non-trivial because it’s an energy-based model of the soft Q function, with an unknwon normalizing constant.

The paper introduces SVGD to sample from the policy. This involves:

  • Parametrize a policy network $f^\phi\left(\xi ; \mathbf{s}_t\right)$ that maps noise sampple $\xi$ from Gaussian to an action.

  • Minimizing the KL divergence of $\pi^\phi$ to the energy-based policy of Q functions: First randomly ample a set of actions, and compute the most greedy directions for them to minimize the KL divergence, then backpropagate the gradient into the policy network.

Because this part is relatively apart from the main formulation of maximum entropy RL, we skip the details of SVGD in this blog.


Acknowledgement: Thanks to Yingyu Lin for disucssion and proofreading.