当前位置：网站首页>Reinforcement learning - proximal policy optimization algorithms

Reinforcement learning - proximal policy optimization algorithms

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
Why PPO
TRPO
PPO

Preface

This paper gives a brief introduction to the thesis 《Proximal Policy Optimization Algorithms》 Summarize , If there is a mistake , Welcome to point out .

Why PPO

The mathematical expression of random strategy gradient is
$\nabla J(\theta)=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\tag{1.0}$
Unless otherwise mentioned , Article $A$ Show action , $S$ According to state , $\pi$ Presentation strategy . Once used 1.0 Update policy network , It requires strategic network control agents to interact with the environment , In order to gain $S$ And $A$ , It's very time-consuming . If you can sample in advance before training $S$ And $A$ , And store it in the experience playback array , Then extract from the experience playback array $S$ And $A$ Training strategy network , This will greatly reduce the training time . Based on the above considerations , There is a Proximal Policy Optimization(PPO), Introducing PPO Before , Let's introduce TRPO.

TRPO

TPRO Use importance sampling , Approximate the random strategy gradient , So we can use the experience playback array to quickly update the strategy network . According to the mathematical definition of expectation , The mathematical form of importance sampling can be obtained ：
$E_{x \sim p}[f(x)]=E_{x \sim q}[\frac{p(x)}{q(x)}f(x)]\tag{2.0}$
It is worth mentioning that , type 2.0 Although the distributions on both sides of the equation are expected to be the same , But the variance is different ：

$\begin{aligned} Var_{x\sim p}[f(x)]&=E_{x\sim p}[f(x)^2]-(E_{x\sim p}[f(x)])^2\\ Var_{x\sim q}[\frac{p(x)}{q(x)}f(x)]&=E_{x\sim p}[f(x)^2\frac{p(x)}{q(x)}]-(E_{x\sim p}[f(x)])^2 \end{aligned}$

When distribution $p (x)$ And $q (x)$ Approximate time , type 2.0 The variance of the distribution on both sides of the equation is also approximate .

Utilization 2.0 fitting 1.0 Make mathematical changes to get
$\begin{aligned} \nabla J(\theta)&=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\\ &=E_S[E_{A\sim \pi'(.|S;\theta_{old})}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\\ &\approx E_S[E_{A\sim \pi'(.|S;\theta_{old})}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\tag{2.1} \end{aligned}$

When strategy $\pi$ And strategy $\pi'$ Approximate time , Yes $Q_\pi(S,A)\approx Q_{\pi'}(S,A)$ , From this, we can get the formula 2.1 The third line in is approximately equal to the symbol . In order to make the strategy $\pi$ And strategy $\pi'$ The approximate , therefore TPRO The strategy $\pi$ And strategy $\pi'$ Of KL Divergence as a regular term . The overall loss function is
$\begin{aligned} \max L(\theta)=\max E_{ {A\sim \pi'(.|S;\theta_{old})},S}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}V_{\pi}(S)]-\beta KL(\pi(A|S;\theta),\pi'(A|S;\theta_{old})) \end{aligned}$
$E_{ {A\sim \pi'(.|S;\theta_{old})},S}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}V_{\pi}(S)]$ Yes $\theta$ The derivative of is the formula 2.1, $\beta$ Is a super parameter . Before training , Use strategy $\pi'(A|S;\theta_{old})$ ( You can use past parameters $\theta_{old}$ Strategic network ) Control the interaction between agent and environment , Match a series of actions and States ( $s_t$ , $a_t$ ) Stored in the experience playback array . During training , Extract a series of actions and states from the experience playback array , According to the above loss function , Use the gradient rise method to update the strategy network （KL How to calculate the divergence term can refer to the cross entropy ）.

PPO

PPO Yes TRPO The optimization objective of has been changed , set up $r(\theta)=\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}$ , $A=V_\pi(S)$ , Then the optimization goal ( Some symbols are omitted ) Turn into
$\begin{aligned} \max L^{clip}(\theta)=\max E_{A\sim \pi',S}[\min (r(\theta)A,clip(r(\theta),1-\epsilon,1+\epsilon))A)]\tag{3.0} \end{aligned}$
$clip(r(\theta),1-\epsilon,1+\epsilon))$ Said when $r(\theta)<1-\epsilon$ when , Will be truncated to $1-\epsilon$ , When $r(\theta)>1+\epsilon$ when , Will be truncated to $1+\epsilon$ . $\epsilon$ Is a super parameter . type 3.0 The value image of is
Insert picture description here
The basic idea is , When $r(\theta)$ be located $[1-\epsilon,1+\epsilon]$ when , Strategy $\pi$ And strategy $\pi'$ It's more like , Present tense 2.1 The approximately equal sign in is established , At this time, the gradient is approximately equal to the random strategy gradient （ type 1.0）. When $r(\theta)$ Not located in $[1-\epsilon,1+\epsilon]$ when , At this time, there is a large gap between the gradient and the random strategy gradient , The signal provided by the gradient may be wrong , Therefore, the range of updating parameters should be weaker .

Based on the above analysis , Let's see Figure 1. When A>0 when , if $r(\theta)$ The value is greater than $1+\epsilon$ , Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
$0<(1-\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )<\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)$
That is, use a smaller gradient to update the parameters . if $r(\theta)$ Value is less than $1-\epsilon$ when , Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
$(1-\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)>\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )>0$
That is, use a smaller gradient to update the parameters . When A<0 when , if $r(\theta)$ The value is greater than $1+\epsilon$ , Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
$0>(1+\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)>\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )$
At this time, it is found that even if the gradient contains error information ,PPO Still use it to update policy network parameters . An understanding of this is that since $A < 0$ , It shows that this action is not advocated , Therefore, a larger gradient is still used to update the network , So that they try not to do this action , When $r(\theta)$ Value is less than $1-\epsilon$ The same goes for . The explanation of the original text is as follows

With this scheme, we only ignore the change in probability ratio when it 
would make the objective improve, and we include it when it makes the objective worse

PPO stay A<0 Part of the way is a little against my intuition , in fact ,《Mastering Complex Control in MOBA Games with Deep Reinforcement Learning》 The article points out that PPO stay A<0 It will make the policy network difficult to converge , Thus it is proposed that Dual PPO, stay A<0 when , type 3.0 The value of is shown in the figure below (b), When the value is greater than $c$ when , At this time, use a smaller gradient to update .
Insert picture description here