Most deep reinforcement learning (RL) is based on the deterministic notion of optimality, where the optimal solution is always a deterministic policy (at least under full observability). However, a stochastic policy may be preferred due to:

  • Better exploration: Strike a balance between exploration and exploitation, in contrast to heuristic exploration methods like $\epsilon$-greedy.
  • Robustness and better generalization: A stochastic policy is more robust to adversarial perturbations and serves as a better initialization for learning new skills.

Maximum entropy RL attempts to optimize the entropy of policy in addition to the expected reward, which encourages the policy to be more stochastic. The new learning objective also has some interesting properties and leads to useful algorithms (Soft Q-Learning, Path Consistency Learning, Soft Actor-Critic, etc.).

In this blog, we’ll first review some basics of functional derivatives, and use them to derive the soft Bellman equation, a core theorem in maximum entropy RL. Finally, we’ll discuss a concrete algorithm, Soft Q-Learning.

Functional Derivative

In this section, we provide a quick introduction of functional derivative, which is crucial to prove some key theorems in maximum entropy RL.

Definition

A functional $J(q)$ is a mapping from function $q(x)$ to a real value. Here, we focus on a simple case of functional, which can be expressed as an integration of a function of $q(x)$:

\[J[q] = \int_a^b F(q(x), x) \, dx\]

To measure “how a small change in $q(x)$ would affect $J(q)$”, we define the functinoal derivative $\frac{\delta J}{\delta q(x)}$ as follows:

\[J[q + \epsilon \eta] - J[q] \approx \epsilon \int \frac{\delta J}{\delta q(x)} \eta(x) \, dx\]

Here, $\eta(x)$ is an arbitrary test function, similar to the role of $\Delta x$ in function derivative.

Solution

It’s easy to calculate the functional derivative of $J$ in the simple form defined above. First, adding some perturbation to $J$:

\[J[q + \epsilon \eta] = \int_a^b F(q(x) + \epsilon \eta(x), x) \, dx\]

Applying Taylor expansion to $F$ around $x$:

\[J[q + \epsilon \eta] = \int_a^b \left[F(q(x), x) \, + \frac{\partial F}{\partial q} \epsilon \eta(x)\right] dx\] \[J[q + \epsilon \eta] - J[q] = \epsilon \int_a^b \frac{\partial F}{\partial q} \eta(x) dx\]

Looking back at the definition of functional derivative, we’ll find that $\frac{\delta J}{\delta q(x)}=\frac{\partial F}{\partial q}$

Application

The conclusion above can be applied to solve the problem of entropy-regularized expectation optimization:

\[J(q) = \int q(x) f(x) \, dx - \lambda \int q(x) \log q(x) \, dx\]

The functional derivative of $ J(q) $ with respect to $ q(x) $ involves:

  • Expectation Term: \(\frac{\delta}{\delta q(x)} \left( q(x) f(x) \, \right) = f(x)\)

  • Entropy Term: \(\frac{\delta}{\delta q(x)} \left(- \lambda q(x) \log q(x) \, \right) = -\lambda (\log q(x) + 1)\)

Combining these:

\[\frac{\delta J(q)}{\delta q(x)} = f(x) - \lambda (\log q(x) + 1)\]

Setting this to zero to find the optimal $ q(x)$:

\[f(x) - \lambda (\log q(x) + 1) = 0\]

Solving for $q(x)$:

\[q(x) = \exp\left( \frac{f(x)}{\lambda} - 1 \right)\]

Since we haven’t put any constraint to $q$, the resulting $q$ may not be a probability distribution. However, we know that adding any constant $C$ to $f(x)$ will not affect the solution $q(x)$, and there must exist a $C$ that makes the solution $q(x)$ integrate to $1$. Then, the solution $q(x)$ to the new problem will still satisfy:

\[q(x) \propto \exp\left( \frac{f(x)}{\lambda} \right)\]

Soft Bellman Equation

The objective of maximum entropy RL is:

\[\pi_{\text {MaxEnt }}^\ast=\arg \max _\pi \sum_t \mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}_t\right) \sim \rho_\pi}\left[r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_t\right)\right)\right]\]

Note this objective not only maximizes entropy at the current time step, but also encourages the policy to reach states where they will have high entropy in the future.

We can first define soft Q-function and soft value function by incorporating entropy into standard Q-function and value function:

\[\begin{aligned} & Q_{\text {soft }}^{\pi}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r_t+ \mathbb{E}_{\left(\mathbf{s}_{t+1}, \ldots\right) \sim \rho_\pi}\left[\sum_{l=1}^{\infty} \gamma^l\left(r_{t+l}+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_{t+l}\right)\right)\right)\right] \end{aligned}\] \[V_{\mathrm{soft}}^\pi\left(\mathbf{s}_t\right)=\mathbb{E}_{a'\sim \pi(\mathbf{s}_t)} \left[Q_{\mathrm{soft}}^\pi\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right) + \alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_{t+l}\right)\right)\right]\]

Note that the above definition of soft value function has been shortened by using $Q$.

Assume we have a policy $\pi$ along with its corresponding soft Q-function $Q_{\mathrm{soft}}^\pi$. We want to construct an improved policy $\hat{\pi}$ at a specific state $\mathbf{s}_t$. Formally, we seek

\[\hat{\pi} \;=\; \arg\max_{\pi'} \Bigl( \mathbb{E}_{\mathbf{a}_t \sim \pi'} \bigl[ Q_{\mathrm{soft}}^\pi(\mathbf{s}_t, \mathbf{a}_t) \bigr] \;+\; \alpha \,\mathcal{H}\bigl(\pi'(\cdot \mid \mathbf{s}_t)\bigr) \Bigr),\]

while leaving the policy unchanged at all other states.

By the functional-derivative argument (as shown in the previous section), maximizing the objective with respect to $\pi’(\cdot \mid \mathbf{s}_t)$ yields

\[\hat{\pi}(\mathbf{a}_t \mid \mathbf{s}_t) \;\propto\; \exp\Bigl(\tfrac{1}{\alpha} \,Q_{\mathrm{soft}}^\pi(\mathbf{s}_t, \mathbf{a}_t)\Bigr).\]

Hence, $\hat{\pi}$ strictly increases the objective at $\mathbf{s}_t$ unless $\pi$ already takes this exponential form. Consequently, if a policy $\pi^\ast$ is globally optimal, it must be impossible to improve it further by applying the same transformation—meaning $\pi^\ast$ must itself obey:

\[\pi^\ast(\mathbf{a}_t \mid \mathbf{s}_t) \;\propto\; \exp\Bigl(\tfrac{1}{\alpha}\, Q_{\mathrm{soft}}^\ast(\mathbf{s}_t, \mathbf{a}_t)\Bigr).\]

This shows that an optimal soft policy necessarily follows an exponential distribution over actions, weighted by the soft $Q$-function.

We can directly plug it into the definition of soft value function, and get the soft Bellman equation (Theorem 2 in Soft Q-Learning):

\[V_{\mathrm{soft}}^\ast\left(\mathbf{s}_t\right)=\alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^\ast\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}\]

When $\alpha=0$, it becomes standard Bellman equation, where the $\log \int\exp$ operator becomes $\max$.

The integration in the equation above is the normalizing constant of policy, so we can express the constant with soft value function:

\[\int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^\ast\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}=\exp \frac{1}{\alpha}V_{\mathrm{soft}}^\ast\left(\mathbf{s}_t\right)\]

Then, the optimal soft policy can be written as below (Theorem 1 in Soft Q-Learning):

\[\pi_{\text {MaxEnt }}^\ast\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\exp \left(\frac{1}{\alpha}\left(Q_{\mathrm{soft}}^\ast\left(\mathbf{s}_t, \mathbf{a}_t\right)-V_{\mathrm{soft}}^\ast\left(\mathbf{s}_t\right)\right)\right)\]

Soft Q-Learning

Soft Q-Learning is based on fixed-point iteration that resembles Q-iteration.

\[\begin{aligned} Q_{\text {soft }}\left(\mathbf{s}_t, \mathbf{a}_t\right) & \leftarrow r_t+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathbf{s}}}\left[V_{\text {soft }}\left(\mathbf{s}_{t+1}\right)\right], \forall \mathbf{s}_t, \mathbf{a}_t \\ V_{\text {soft }}\left(\mathbf{s}_t\right) & \leftarrow \alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\text {soft }}\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}, \forall \mathbf{s}_t \end{aligned}\]

The iteration will converge to \(Q^\ast_{\text{soft}}\) and \(V^\ast_{\text{soft}}\) (Theorem 3 in Soft Q-Learning). The policy can be monotonically improved in this way (recall the derivation above), and the process would only stop when the soft Bellman equation is satisfied everywhere.

Training

Soft Q Learning parametrize soft Q function $Q_{\text{soft}}^\theta(s,a)$ and represent soft value function with it. At each step of soft Q-iteration:

  • First estimate $V^\theta_{\text{soft}}(s)$ with importance sampling using any policy $q$. Note that if the action space is finite, importance sampling is unnecessary, as we could directly enumerate all possible actions. \(V_{\mathrm{soft}}^\theta\left(\mathbf{s}_t\right)=\alpha \log \mathbb{E}_{q_{\mathbf{a}^{\prime}}}\left[\frac{\exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^\theta\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)}{q_{\mathbf{a}^{\prime}}\left(\mathbf{a}^{\prime}\right)}\right]\)

  • Then we could calculate the target Q function using the value function of next states. \(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r_t+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathrm{s}}}\left[V_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\)

  • Fit the soft Q function to the target Q function: \(J_Q(\theta)=\mathbb{E}_{\mathbf{s}_t \sim q_{s_t}, \mathbf{a}_t \sim q_{a_t}}\left[\frac{1}{2}\left(\hat{Q}_{\text {soft }}^{\bar{\theta}}\left(\mathbf{s}_t, \mathbf{a}_t\right)-Q_{\text {soft }}^\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)\right)^2\right]\)

Inference

The training procedure above leads to a model of the (optimal) soft q function, which is closely connected to the optimal policy as shown in the previous section. However, it’s still non-trivial to derive a policy directly from the soft q function, since it’s an energy-based model with an unknwon normalizing constant.

The Soft Q-Learning paper leverages SVGD to sample from the policy. This involves:

  • Parametrize a policy network $f^\phi\left(\xi ; \mathbf{s}_t\right)$ that maps noise sampple $\xi$ from Gaussian to an action.

  • Minimizing the KL divergence of $\pi^\phi$ to the energy-based policy of Q functions: First randomly sample a set of actions, and compute the most greedy directions for them to minimize the KL divergence, then backpropagate the gradient into the policy network.

Because this part is relatively apart from the main formulation of maximum entropy RL, we skip the details of SVGD in this blog.


Acknowledgement: Thanks to Yingyu Lin for disucssion and proofreading.