当前位置：网站首页>Policy Gradient Methods of Deep Reinforcement Learning (Part Two)

Policy Gradient Methods of Deep Reinforcement Learning (Part Two)

2022-07-03 10:13:00 【Most appropriate commitment】

Abstract

This paper will discuss the distribution space Natrual Gradient, And then Natural Gradient be used for Actor Critic. In addition Trust Region Policy Optimization(TRPO) and Proximal Policy Optimization(PPO) Algorithm .

Part One: Basic Knowledge in Natural Gradient

Disadvantages of parameter space update ：

When we simulate a function with finite parameters , Even if we make a small change to a parameter , Every time we change, sometimes it's huge .

Such as below , The distribution in the figure is normal , The figures on the left are N(0,0.2),N(1,0.2) , On the right are N(0,10),N(1,10) . In the parameter space, the Euler distances of the two curves of the left and right graphs are 1( $d=\sqrt{(\mu_0-\mu_1)^2+(\sigma_0-\sigma_1)^2}$ ). However, we can clearly see that the similarity between the two curves in the left figure is poor , The similarity in the right figure is high . therefore , Directly change the parameters in the parameter space , Even if the parameter change is very small , Sometimes it will cause great changes in strategy . And the huge change of strategy will cause the instability of learning . Besides, we are Actor-critic methods The strategy changes less each time ,value function It can be updated every time , Match the current strategy . When the strategy changes greatly ,value function It will be difficult to respond to current policies , Make the update direction appear huge deviation , Will not get the optimal solution .

$\Delta \theta$

In order to solve the problems in the above parameter space , We need to introduce distribution space , That is, the space where the actual distribution is located . The space where the actual distribution is located is the numerical distribution of the physical quantity concerned . At the same time, it is necessary to introduce physical quantities to measure the similarity of the two distributions .

KL(Kullback-Leiber) divergence

For two distributions p and q Describe the degree of similarity , We introduced KL Divergence The measurement of .

$D_{KL}(p||q)=\int p(x) \ log \frac{p(x)}{q(x)} dx=E_{x \sim p}[log \frac{p(x)}{q(x)}]$

We will integrate in the distribution space （ The actual process will be sampling, Then sum it ）, Then superimpose the results of each point . We can see if p and q Exactly the same ,KL Divergence It will be 0. Besides, we can see KL Divergence It's asymmetric , $D_{KL}(p||q) \neq D_{KL}(q||p)$ , But when p and q Very close , The two are approximately equal .

Of course , We can also introduce Jensen-Shannon(JS) Divergence, Eliminate this asymmetry ：

$D_{JS}(p||q)=\frac{D_{KL}(p||q)+D_{KL}(q||p)}{2}$

Of course , When p and q Very close , $D_{KL}(p||q)\approx D_{JS}(p||q)$

The connection between parameter space and distribution space ： Fisher Matrix

We actually manipulate $\Delta \theta$ Change strategy in parameter space , But we need to ensure that the strategies of adjacent steps have high similarity in the distribution space , It has been less than our set value $\delta$ . So we need to establish $\Delta \theta$ and KL Divergence The relationship between .

$D_{KL}(p(x;\theta)||p(x;\theta+\Delta \theta))\\ = D_{KL}(p(x;\theta)||p(x;\theta)) + \bigtriangledown_{\theta'} D_{KL}(p(x;\theta)||p(x;\theta +\Delta \theta))|_{\Delta \theta =0}^T \ \Delta \theta + \frac{1}{2} \Delta \theta^T \bigtriangledown^2_{\theta'} D_{KL}(p(x;\theta)||p(x;\theta +\Delta \theta))|_{\Delta \theta =0}\ \Delta \theta \\= \bigtriangledown_{\theta'} D_{KL}(p(x;\theta)||p(x;\theta +\Delta \theta))|_{\Delta \theta =0}^T\ \Delta \theta +\frac{1}{2} \Delta \theta^T \bigtriangledown^2_{\theta'} D_{KL}(p(x;\theta)||p(x;\theta +\Delta \theta))|_{\Delta \theta =0}\ \Delta \theta$

For the first one $\bigtriangledown_{\theta'} D_{KL}[p(x;\theta)||p(x;\theta +\Delta \theta)] =\bigtriangledown_{\theta'} E_{p(x;\theta)}[log \ p(x;\theta)]- \bigtriangledown_{\theta'}E_{p(x;\theta)}[log \ p(x;\theta+\Delta \theta)]$

And one of

$\bigtriangledown_{\theta'}E_{p(x;\theta)}[log\ p(x;\theta)]= E_{p(x;\theta)}[\bigtriangledown_{\theta'}log \ p(x;\theta)]=\int p(x;\theta) \bigtriangledown_{\theta'}log\ p(x;\theta)=\int p(x;\theta) \frac{\bigtriangledown_{\theta'}p(x;\theta)}{p(x;\theta)} = \int \bigtriangledown_{\theta'}p(x;\theta)=\bigtriangledown_{\theta'}\int p(x;\theta)=\bigtriangledown_{\theta'} 1=0$

$\bigtriangledown_{\theta'}E_{p(x;\theta)}[log \ p(x;\theta+\Delta \theta)]=E_{p(x;\theta)}[\bigtriangledown_{\theta'}log\ p(x;\theta')]|_{\theta'=\theta}=0$

$\bigtriangledown_{\theta'} D_{KL}[p(x;\theta)||p(x;\theta +\Delta \theta)] =0$

therefore ,

$D_{KL}(p(x;\theta)||p(x;\theta+\Delta \theta))=\Delta \theta^T \bigtriangledown^2_{\theta'}D_{KL}(p(x;\theta)||p(x;\theta')) \Delta \theta$

Therefore, the similarity sum of two adjacent strategies in the distribution space $\Delta \theta$ The relationship between them is established . But we still need to know KL divergence What does the second derivative of mean .

$\bigtriangledown^2_{\theta'}D_{KL}[p(x;\theta)||p(x;\theta')] = -\int p(x;\theta) \bigtriangledown^2_{\theta'} log\ p(x|\theta')|_{\theta'=\theta} dx=-\int p(x;\theta) H_{log \ p(x|\theta)}dx= -E_{p(x;\theta)}[H_{log \ p(x;\theta)}]$

We can get $D_{KL}$ The second step of is $log \ p(x;\theta)$ The negative expectation of Hessian matrix .

The negative expectation of Hessian matrix is Fisher Matrix.

$F=E_{p(x;\theta)}[\bigtriangledown log\ p(x;\theta) \bigtriangledown log \ p(x;\theta)^T]=\frac{1}{N}\sum_{i=1}^{N}\bigtriangledown\ log\ p(x_i|\theta) \bigtriangledown\ log \ p(x_i|\theta)^T$

Fisher Information Matrix - Agustinus Kristiadi's Blog

Natural Gradient Descent - Agustinus Kristiadi's Blog

Natural Gradient

Aim

$\Delta \theta = argmax \ L(\theta+ \Delta \theta)$

$L(\theta + \Delta \theta)=\triangledown_\theta L(\theta) \Delta \theta = g \Delta \theta$

Constraint:

$D_{LK}(p(x;\theta)||p(x;\theta+\Delta \theta))=\frac{1}{2}\Delta \theta^T F \Delta \theta < \delta$

According to convex optimization, we can get the result ：

$\Delta \theta = \sqrt{\frac{2\delta}{g^TF^{-1}g}}F^{-1}g$

and The specific function content is the same as natural gradient irrelevant , because natrual gradient It just solves the problem of making adjacent strategies have strong similarity （ Changed the update direction of the parameter vector ）, Ensure the stability of the learning process . meanwhile natural gradient Curvature dependent inverse , When the loss equation is in a flat position , Larger step size ; When the loss equation is in a steeper region , Smaller step size . It's just Fisher Matrix The size of increases with the size of the parameter vector , The size is $N \times N$ , Therefore, finding the inverse will be more troublesome . So sometimes we use conjugate gradient,K-FAC Etc .

Part Two: Natural Actor Critic

Part Three: Trust Region Policy Optimization TRPO

Objective Function

TRPO The objective function used in is $\eta(\pi_\theta)=E_{\tau \sim \pi_\theta}[\sum_{t=0}^{\infty}\gamma^t r_t]$

$\eta(\pi_{i+1})\\= \eta(\pi_i) + \sum_s \rho_{\pi_{i+1}}(s)\sum_a \pi_{i+1}(a|s)A_{\pi}(s,a)$

prove ：

$E_{\tau \sim \pi_{i+1}}[\sum_{t=0}^\infty \gamma^t A^\pi(s_t,a_t)] \\= E_{\tau \sim \pi_{i+1}}[\sum_{t=0}^\infty \gamma^t( R(s_t,a_t,s_{t+1})+\gamma V^\pi (s_{t+1})-V^\pi(s_t) )] \\=\eta(\pi_{i+1})+ E_{\tau \sim \pi_{i+1}}[\sum_{t=1}^{\infty} \gamma^{t}V^\pi(s_t) - \sum_{t=0}^{\infty}\gamma^t V^{\pi}(s_t) ] \\= \eta(\pi_{i+1}) - E_{\tau \sim \pi_{i+1}}[V^\pi(s_0)] \\= \eta(\pi_{i+1}) - \eta(\pi)$

Surrogate Function

TRPO use Minorize-Maximization Algorithom, Find a lower boundary function under the parameter vector at each time surrogate function, Parameter vectors are also used as parameters , Under this parameter vector and $\eta(\pi_\theta)$ equal , Less than... Under other parameters in the parameter space $\eta(\pi_\theta)$ . In addition, the lower boundary function needs to be easier to optimize than the original function .

Because we know that the lower boundary function is close to the objective function under the current parameter vector , Therefore, we can know the direction of the lower boundary function near the current parameter vector The optimization direction is approximate to the objective function , So we can choose a lower boundary function at every moment , Then optimize the lower boundary function every time , So as to get the optimization of the objective function . So the key is how to choose the lower boundary function .

$\eta(\pi_{i+1}) = \eta(\pi_i) + \sum_s \rho_{\pi_{i+1}}(s)\sum_a \pi_{i+1}(a|s)A_\pi(s,a) \approx \eta(\pi_i) + \sum_s \rho_{\pi_i}(s) \sum_a \pi_{i+1}(a|s)A_\pi(s,a) \approx \eta(\pi_i) + \sum_s \rho_{\pi_i}(s) \sum_a \pi_i(a|s) \frac{\pi_{i+1}(a|s)}{\pi_i(a|s)}A_\pi(s,a)$

We define functions

$J_{\pi_{i}}(\pi_{i+1}) = \eta(\pi_i) + \sum_s \rho_{\pi_i}(s) \sum_a \pi_i(a|s) \frac{\pi_{i+1}(a|s)}{\pi_i(a|s)}A_\pi(s,a)$

$\bigtriangledown_{\theta} J_{\pi_i}(\theta)|_{\theta=\theta_{old}} =\bigtriangledown_\theta \eta(\theta)|_{\theta = \theta_{old}}$

Then we define surrogate function：

$L(\theta)=J_{\theta_{old}}(\pi_\theta)-C D_{KL}(\pi_{\theta_{old}}||\pi_\theta)$

among $C=\frac{4\epsilon \gamma}{(1-\gamma)^2}$

How to understand TRPO All the mathematical derivation details in ? - You know (zhihu.com)

here $L(\theta)$ Meet the following three conditions ：

$L(\theta_{old})=\eta(\theta_{old})$
$\bigtriangledown_\theta L_{\theta_{old}}(\theta)|_{\theta=\theta_{old}}=\bigtriangledown_\theta \eta(\theta)|_{\theta=\theta_{old}}$
$\eta(\theta) \geq J_{\theta_{old}}(\theta)-C D_{KL}(\pi_{\theta_{old}}||\pi_\theta)$

therefore $L(\theta)$ Is a lower boundary function of the objective function , The objective function can be optimized through the lower boundary function at the point of the parameter vector .

Use it directly natural gradient To optimize

aim: $J_{\pi_{i}}(\pi_{i+1}) = \sum_s \rho_{\pi_i}(s) \sum_a \pi_i(a|s) \frac{\pi_{i+1}(a|s)}{\pi_i(a|s)}A_\pi(s,a)$

constraint: $D_{KL}(\pi_{\theta_{old}}||\pi_\theta) < \delta$

$g_k \\=\bigtriangledown_{\theta_{i+1}}J_{\theta_i}(\theta_{i+1})\\=E_{s \sim \rho_{\theta_i},a \sim \pi_{\theta_i}}[\frac{\bigtriangledown_{\theta_{i+1}}\pi_{i+1}(a|s)}{\pi_i(a|s)}A_\pi(s,a)]\\=E_{s \sim \rho_{\theta_i},a \sim \pi_{\theta_i}}[\bigtriangledown_{\theta_{i}}log \ \pi(a|s)A_\pi(s,a)]|_{\theta_{i+1}=\theta_i} \\=\frac{1}{N}\sum_{l=0}^N \sum_{t=0}^{T}\bigtriangledown_{\theta_i} log \ \pi(a_t|s_t)A_\pi(s_t,a_t)$

$\Delta \theta = \sqrt{\frac{2\delta}{g^TF^{-1}g}}F^{-1}g$

however , In order to prevent the new parameter vector and the original parameter vector $D_{KL}$ exceed $\delta$ , We use the following formula to update the parameter vector ：

$\theta_{k+1}=\theta_k+\alpha^j \Delta \theta, where \ j \in \left \{ 0,1,2,...,K \right \}$

j Take the minimum nonnegative integer that can make the similarity of adjacent strategies meet the requirements .

TRPO It works better in the full connection layer , But in CNN perhaps RNN Poor performance in the middle school .

Part Four: Proximal Policy Optimization PPO

In view of the above TRPO Poor algorithm complexity , Higher computation ,PPO By changing surrogate function To optimize .

$L^{CLIP}(\theta) = E_t[min( \rho_t(\theta)A_{\pi_{\theta_{old}}}(s_t,a_t), \ clip(\rho_t(\theta),1-\epsilon,1+\epsilon)A_{\pi_{\theta_{old}}}(s_t,a_t)) ]$

When $A_{\pi_{\theta_{old}}}(s_t,a_t)$ At the right time , It shows that this is a better action , We should increase the probability , But in order to prevent the probability of increasing too much , We need to limit , Make the parameter vector in the right direction , Move smaller steps , Make the adjacent strategies approximate . When $A_{\pi_{\theta_{old}}}(s_t,a_t)$ When it's negative , This is an action that should reduce the probability , In order to prevent too much reduction , We limit its step size .

in addition , We go through clip The similarity limit of adjacent strategies has been implemented , Therefore, it is possible to avoid KL Divergence The calculation of , It can reduce a lot of computation , But use SGD(stochastic gradient descent) Achieve a first-order gradient . meanwhile , Because the similarity between adjacent strategies is high , Can not be completely satisfied on-policy How to update , Can be in multiple episodes Updated many times in , The complexity of the algorithm is greatly reduced .