Maximum Entropy RL (1): Soft QLearning
Most deep reinforcement learning (RL) is based on the deterministic notion of optimality, where the optimal solution is always a deterministic policy (at least under full observability). However, a stochastic policy may be preferred due to:
 Better exploration: Strike a balance between exploration and exploitation, in contrast to heuristic exploration methods like $\epsilon$greedy.
 Robustness and better generalization: A stochastic policy is more robust to adversarial perturbations and serves as a better initialization for learning new skills.
Maximum entropy RL attempts to optimize the entropy of policy in addition to the expected reward, which encourages the policy to be more stochastic. The new learning objective also has some interesting properties and leads to useful algorithms which bridge traditional value and policy based methods (Soft QLearning, Path Consistency Learning, Soft ActorCritic, etc.)
Functional Derivative
In this section, we provide a quick introduction of functional derivative, which is crucial to prove some key theorems in maximum entropy RL.
Definition
A functional $J(q)$ is a mapping from function $q(x)$ to a real value. Here, we focus on a simple case of functional, which can be expressed as an integration of a function of $q(x)$:
\[J[q] = \int_a^b F(q(x), x) \, dx\]To measure “how a small change in $q(x)$ would affect $J(q)$”, we define the functinoal derivative $\frac{\delta J}{\delta q(x)}$ as follows:
\[J[q + \epsilon \eta]  J[q] \approx \epsilon \int \frac{\delta J}{\delta q(x)} \eta(x) \, dx\]Here, $\eta(x)$ is an arbitrary test function, similar to the role of $\Delta x$ in function derivative.
Solution
It’s easy to calculate the functional derivative of $J$ in the simple form defined above. First, adding some perturbation to $J$:
\[J[q + \epsilon \eta] = \int_a^b F(q(x) + \epsilon \eta(x), x) \, dx\]Applying Taylor expansion to $F$ around $x$:
\[J[q + \epsilon \eta] = \int_a^b \left[F(q(x), x) \, + \frac{\partial F}{\partial q} \epsilon \eta(x)\right] dx\] \[J[q + \epsilon \eta]  J[q] = \epsilon \int_a^b \frac{\partial F}{\partial q} \eta(x) dx\]Looking back at the definition of functional derivative, we’ll find that $\frac{\delta J}{\delta q(x)}=\frac{\partial F}{\partial q}$
Application
The conclusion above can be applied to solve the problem of entropyregularized expectation optimization:
\[\mathcal{F}(q) = \int q(x) f(x) \, dx  \lambda \int q(x) \log q(x) \, dx\]The functional derivative of $ \mathcal{F}(q) $ with respect to $ q(x) $ involves:

Expectation Term: \(\frac{\delta}{\delta q(x)} \left( \int q(x) f(x) \, dx \right) = f(x)\)

Entropy Term: \(\frac{\delta}{\delta q(x)} \left( \lambda \int q(x) \log q(x) \, dx \right) = \lambda (\log q(x) + 1)\)
Combining these:
\[\frac{\delta \mathcal{F}(q)}{\delta q(x)} = f(x)  \lambda (\log q(x) + 1)\]Setting this to zero to find the optimal $ q(x)$:
\[f(x)  \lambda (\log q(x) + 1) = 0\]Solving for $q(x)$:
\[q(x) = \exp\left( \frac{f(x)}{\lambda}  1 \right)\]Since we haven’t put any constraint to $q$, the resulting $q$ may not be a probability distribution. However, we know that adding any constant $C$ to $f(x)$ will not affect the solution $q(x)$, and there must exist a $C$ that makes the solution $q(x)$ integrate to $1$. Then, the solution $q(x)$ to the new problem will still satisfy:
\[q(x) \propto \exp\left( \frac{f(x)}{\lambda} \right)\]Formulation
The objective of maximum entropy RL is:
\[\pi_{\text {MaxEnt }}^*=\arg \max _\pi \sum_t \mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}_t\right) \sim \rho_\pi}\left[r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_t\right)\right)\right]\]Note this objective not only maximizes entropy at the current time step, but also encourages the policy to reach states where they will have high entropy in the future.
We can first define soft Qfunction and soft value function by incorporating entropy into standard Qfunction and value function:
\[\begin{aligned} & Q_{\text {soft }}^{\pi}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r_t+ \mathbb{E}_{\left(\mathbf{s}_{t+1}, \ldots\right) \sim \rho_\pi}\left[\sum_{l=1}^{\infty} \gamma^l\left(r_{t+l}+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_{t+l}\right)\right)\right)\right] \end{aligned}\] \[V_{\mathrm{soft}}^\pi\left(\mathbf{s}_t\right)=\mathbb{E}_{a'\sim \pi(\mathbf{s}_t)} \left[Q_{\mathrm{soft}}^\pi\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right) + \alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_{t+l}\right)\right)\right]\]Note that the above definition of soft value function has been shortened by using $Q$.
To improve a policy $\pi$, we can optimize the soft value function above:
\[\hat\pi(\mathbf{s}_t)\propto \exp\left(\frac{1}{\alpha}Q_{\mathrm{soft}}^\pi\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)\]That’s what we have proved with functional derivative.
So, the optimal soft policy has to satisfy the form below, because otherwise it can be further improved in this way.
\[\pi^*(\mathbf{s}_t)\propto \exp\left(\frac{1}{\alpha}Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)\]We can directly plug it into the definition of soft value function, and get soft Bellman equation (Theorem 2 in Soft QLearning):
\[V_{\mathrm{soft}}^*\left(\mathbf{s}_t\right)=\alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}\]When $\alpha=0$, it becomes standard Bellman equation, where the $\log \int\exp$ operator becomes $\max$.
The integration in the equation above is the normalizing constant of policy, so we can express the constant with soft value function:
\[\int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}=\exp \frac{1}{\alpha}V_{\mathrm{soft}}^*\left(\mathbf{s}_t\right)\]Then, the optimal soft policy can be written as below (Theorem 1 in Soft QLearning):
\[\pi_{\text {MaxEnt }}^*\left(\mathbf{a}_t \mid \mathbf{s}_t\right)=\exp \left(\frac{1}{\alpha}\left(Q_{\mathrm{soft}}^*\left(\mathbf{s}_t, \mathbf{a}_t\right)V_{\mathrm{soft}}^*\left(\mathbf{s}_t\right)\right)\right)\]Soft QIteration
Soft QLearning is based on fixedpoint iteration that resembles Qiteration.
\[\begin{aligned} Q_{\text {soft }}\left(\mathbf{s}_t, \mathbf{a}_t\right) & \leftarrow r_t+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathbf{s}}}\left[V_{\text {soft }}\left(\mathbf{s}_{t+1}\right)\right], \forall \mathbf{s}_t, \mathbf{a}_t \\ V_{\text {soft }}\left(\mathbf{s}_t\right) & \leftarrow \alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\text {soft }}\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}, \forall \mathbf{s}_t \end{aligned}\]The iteration will converge to \(Q^*_{\text{soft}}\) and \(V^*_{\text{soft}}\) (Theorem 3 in Soft QLearning). Apparently, the policy can be monotonically improved in this way, and the process would only stop when the soft Bellman equation is satisfied everywhere.
Learning soft Q function
Soft Q Learning parametrize soft Q function $Q_{\text{soft}}^\theta(s,a)$ and represent soft value function with it. At each step of soft Qiteration:

First estimate $V^\theta_{\text{soft}}(s)$ with importance sampling using any policy $q$: \(V_{\mathrm{soft}}^\theta\left(\mathbf{s}_t\right)=\alpha \log \mathbb{E}_{q_{\mathbf{a}^{\prime}}}\left[\frac{\exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^\theta\left(\mathbf{s}_t, \mathbf{a}^{\prime}\right)\right)}{q_{\mathbf{a}^{\prime}}\left(\mathbf{a}^{\prime}\right)}\right]\)

Then we could calculate the target Q function using the value function of next states. \(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_t, \mathbf{a}_t\right)=r_t+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathrm{s}}}\left[V_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t+1}\right)\right]\)

Optimize the soft Q function towards the target Q function: \(J_Q(\theta)=\mathbb{E}_{\mathbf{s}_t \sim q_{s_t}, \mathbf{a}_t \sim q_{a_t}}\left[\frac{1}{2}\left(\hat{Q}_{\text {soft }}^{\bar{\theta}}\left(\mathbf{s}_t, \mathbf{a}_t\right)Q_{\text {soft }}^\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)\right)^2\right]\)
Sampling from soft policy
The next challenge is to sample from the policy, which is nontrivial because it’s an energybased model of the soft Q function, with an unknwon normalizing constant.
The paper introduces SVGD to sample from the policy. This involves:

Parametrize a policy network $f^\phi\left(\xi ; \mathbf{s}_t\right)$ that maps noise sampple $\xi$ from Gaussian to an action.

Minimizing the KL divergence of $\pi^\phi$ to the energybased policy of Q functions: First randomly ample a set of actions, and compute the most greedy directions for them to minimize the KL divergence, then backpropagate the gradient into the policy network.
Because this part is relatively apart from the main formulation of maximum entropy RL, we skip the details of SVGD in this blog.
Acknowledgement: Thanks to Yingyu Lin for disucssion and proofreading.