当前位置:网站首页>Reinforcement learning - proximal policy optimization algorithms
Reinforcement learning - proximal policy optimization algorithms
2022-07-28 06:10:00 【Food to doubt life】
Preface
This paper gives a brief introduction to the thesis 《Proximal Policy Optimization Algorithms》 Summarize , If there is a mistake , Welcome to point out .
Why PPO
The mathematical expression of random strategy gradient is
∇ J ( θ ) = E S [ E A ∼ π ( . ∣ S ; θ ) [ Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] (1.0) \nabla J(\theta)=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\tag{1.0} ∇J(θ)=ES[EA∼π(.∣S;θ)[Qπ(S,A)∇θlnπ(A∣S;θ)]](1.0)
Unless otherwise mentioned , Article A A A Show action , S S S According to state , π \pi π Presentation strategy . Once used 1.0 Update policy network , It requires strategic network control agents to interact with the environment , In order to gain S S S And A A A, It's very time-consuming . If you can sample in advance before training S S S And A A A, And store it in the experience playback array , Then extract from the experience playback array S S S And A A A Training strategy network , This will greatly reduce the training time . Based on the above considerations , There is a Proximal Policy Optimization(PPO), Introducing PPO Before , Let's introduce TRPO.
TRPO
TPRO Use importance sampling , Approximate the random strategy gradient , So we can use the experience playback array to quickly update the strategy network . According to the mathematical definition of expectation , The mathematical form of importance sampling can be obtained :
E x ∼ p [ f ( x ) ] = E x ∼ q [ p ( x ) q ( x ) f ( x ) ] (2.0) E_{x \sim p}[f(x)]=E_{x \sim q}[\frac{p(x)}{q(x)}f(x)]\tag{2.0} Ex∼p[f(x)]=Ex∼q[q(x)p(x)f(x)](2.0)
It is worth mentioning that , type 2.0 Although the distributions on both sides of the equation are expected to be the same , But the variance is different :
V a r x ∼ p [ f ( x ) ] = E x ∼ p [ f ( x ) 2 ] − ( E x ∼ p [ f ( x ) ] ) 2 V a r x ∼ q [ p ( x ) q ( x ) f ( x ) ] = E x ∼ p [ f ( x ) 2 p ( x ) q ( x ) ] − ( E x ∼ p [ f ( x ) ] ) 2 \begin{aligned} Var_{x\sim p}[f(x)]&=E_{x\sim p}[f(x)^2]-(E_{x\sim p}[f(x)])^2\\ Var_{x\sim q}[\frac{p(x)}{q(x)}f(x)]&=E_{x\sim p}[f(x)^2\frac{p(x)}{q(x)}]-(E_{x\sim p}[f(x)])^2 \end{aligned} Varx∼p[f(x)]Varx∼q[q(x)p(x)f(x)]=Ex∼p[f(x)2]−(Ex∼p[f(x)])2=Ex∼p[f(x)2q(x)p(x)]−(Ex∼p[f(x)])2
When distribution p ( x ) p(x) p(x) And q ( x ) q(x) q(x) Approximate time , type 2.0 The variance of the distribution on both sides of the equation is also approximate .
Utilization 2.0 fitting 1.0 Make mathematical changes to get
∇ J ( θ ) = E S [ E A ∼ π ( . ∣ S ; θ ) [ Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] = E S [ E A ∼ π ′ ( . ∣ S ; θ o l d ) [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] ≈ E S [ E A ∼ π ′ ( . ∣ S ; θ o l d ) [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] (2.1) \begin{aligned} \nabla J(\theta)&=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\\ &=E_S[E_{A\sim \pi'(.|S;\theta_{old})}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\\ &\approx E_S[E_{A\sim \pi'(.|S;\theta_{old})}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\tag{2.1} \end{aligned} ∇J(θ)=ES[EA∼π(.∣S;θ)[Qπ(S,A)∇θlnπ(A∣S;θ)]]=ES[EA∼π′(.∣S;θold)[π′(A∣S;θold)π(A∣S;θ)Qπ(S,A)∇θlnπ(A∣S;θ)]]≈ES[EA∼π′(.∣S;θold)[π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)]](2.1)
When strategy π \pi π And strategy π ′ \pi' π′ Approximate time , Yes Q π ( S , A ) ≈ Q π ′ ( S , A ) Q_\pi(S,A)\approx Q_{\pi'}(S,A) Qπ(S,A)≈Qπ′(S,A), From this, we can get the formula 2.1 The third line in is approximately equal to the symbol . In order to make the strategy π \pi π And strategy π ′ \pi' π′ The approximate , therefore TPRO The strategy π \pi π And strategy π ′ \pi' π′ Of KL Divergence as a regular term . The overall loss function is
max L ( θ ) = max E A ∼ π ′ ( . ∣ S ; θ o l d ) , S [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) V π ( S ) ] − β K L ( π ( A ∣ S ; θ ) , π ′ ( A ∣ S ; θ o l d ) ) \begin{aligned} \max L(\theta)=\max E_{ {A\sim \pi'(.|S;\theta_{old})},S}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}V_{\pi}(S)]-\beta KL(\pi(A|S;\theta),\pi'(A|S;\theta_{old})) \end{aligned} maxL(θ)=maxEA∼π′(.∣S;θold),S[π′(A∣S;θold)π(A∣S;θ)Vπ(S)]−βKL(π(A∣S;θ),π′(A∣S;θold))
E A ∼ π ′ ( . ∣ S ; θ o l d ) , S [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) V π ( S ) ] E_{ {A\sim \pi'(.|S;\theta_{old})},S}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}V_{\pi}(S)] EA∼π′(.∣S;θold),S[π′(A∣S;θold)π(A∣S;θ)Vπ(S)] Yes θ \theta θ The derivative of is the formula 2.1, β \beta β Is a super parameter . Before training , Use strategy π ′ ( A ∣ S ; θ o l d ) \pi'(A|S;\theta_{old}) π′(A∣S;θold)( You can use past parameters θ o l d \theta_{old} θold Strategic network ) Control the interaction between agent and environment , Match a series of actions and States ( s t s_t st, a t a_t at) Stored in the experience playback array . During training , Extract a series of actions and states from the experience playback array , According to the above loss function , Use the gradient rise method to update the strategy network (KL How to calculate the divergence term can refer to the cross entropy ).
PPO
PPO Yes TRPO The optimization objective of has been changed , set up r ( θ ) = π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) r(\theta)=\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})} r(θ)=π′(A∣S;θold)π(A∣S;θ), A = V π ( S ) A=V_\pi(S) A=Vπ(S), Then the optimization goal ( Some symbols are omitted ) Turn into
max L c l i p ( θ ) = max E A ∼ π ′ , S [ min ( r ( θ ) A , c l i p ( r ( θ ) , 1 − ϵ , 1 + ϵ ) ) A ) ] (3.0) \begin{aligned} \max L^{clip}(\theta)=\max E_{A\sim \pi',S}[\min (r(\theta)A,clip(r(\theta),1-\epsilon,1+\epsilon))A)]\tag{3.0} \end{aligned} maxLclip(θ)=maxEA∼π′,S[min(r(θ)A,clip(r(θ),1−ϵ,1+ϵ))A)](3.0)
c l i p ( r ( θ ) , 1 − ϵ , 1 + ϵ ) ) clip(r(\theta),1-\epsilon,1+\epsilon)) clip(r(θ),1−ϵ,1+ϵ)) Said when r ( θ ) < 1 − ϵ r(\theta)<1-\epsilon r(θ)<1−ϵ when , Will be truncated to 1 − ϵ 1-\epsilon 1−ϵ, When r ( θ ) > 1 + ϵ r(\theta)>1+\epsilon r(θ)>1+ϵ when , Will be truncated to 1 + ϵ 1+\epsilon 1+ϵ. ϵ \epsilon ϵ Is a super parameter . type 3.0 The value image of is 
The basic idea is , When r ( θ ) r(\theta) r(θ) be located [ 1 − ϵ , 1 + ϵ ] [1-\epsilon,1+\epsilon] [1−ϵ,1+ϵ] when , Strategy π \pi π And strategy π ′ \pi' π′ It's more like , Present tense 2.1 The approximately equal sign in is established , At this time, the gradient is approximately equal to the random strategy gradient ( type 1.0). When r ( θ ) r(\theta) r(θ) Not located in [ 1 − ϵ , 1 + ϵ ] [1-\epsilon,1+\epsilon] [1−ϵ,1+ϵ] when , At this time, there is a large gap between the gradient and the random strategy gradient , The signal provided by the gradient may be wrong , Therefore, the range of updating parameters should be weaker .
Based on the above analysis , Let's see Figure 1. When A>0 when , if r ( θ ) r(\theta) r(θ) The value is greater than 1 + ϵ 1+\epsilon 1+ϵ, Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
0 < ( 1 − ϵ ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ( use On more new Of ladder degree ) < π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) 0<(1-\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )<\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta) 0<(1−ϵ)Qπ′(S,A)∇θlnπ(A∣S;θ)( use On more new Of ladder degree )<π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)
That is, use a smaller gradient to update the parameters . if r ( θ ) r(\theta) r(θ) Value is less than 1 − ϵ 1-\epsilon 1−ϵ when , Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
( 1 − ϵ ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) > π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ( use On more new Of ladder degree ) > 0 (1-\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)>\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )>0 (1−ϵ)Qπ′(S,A)∇θlnπ(A∣S;θ)>π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)( use On more new Of ladder degree )>0
That is, use a smaller gradient to update the parameters . When A<0 when , if r ( θ ) r(\theta) r(θ) The value is greater than 1 + ϵ 1+\epsilon 1+ϵ, Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
0 > ( 1 + ϵ ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) > π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ( use On more new Of ladder degree ) 0>(1+\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)>\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update ) 0>(1+ϵ)Qπ′(S,A)∇θlnπ(A∣S;θ)>π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)( use On more new Of ladder degree )
At this time, it is found that even if the gradient contains error information ,PPO Still use it to update policy network parameters . An understanding of this is that since A < 0 A<0 A<0, It shows that this action is not advocated , Therefore, a larger gradient is still used to update the network , So that they try not to do this action , When r ( θ ) r(\theta) r(θ) Value is less than 1 − ϵ 1-\epsilon 1−ϵ The same goes for . The explanation of the original text is as follows
With this scheme, we only ignore the change in probability ratio when it
would make the objective improve, and we include it when it makes the objective worse
PPO stay A<0 Part of the way is a little against my intuition , in fact ,《Mastering Complex Control in MOBA Games with Deep Reinforcement Learning》 The article points out that PPO stay A<0 It will make the policy network difficult to converge , Thus it is proposed that Dual PPO, stay A<0 when , type 3.0 The value of is shown in the figure below (b), When the value is greater than c c c when , At this time, use a smaller gradient to update .
边栏推荐
- CertPathValidatorException:validity check failed
- Self attention learning notes
- 速查表之转MD5
- 小程序开发如何提高效率?
- tensorboard可视化
- transformer的理解
- Deep learning (incremental learning) - iccv2022:continuous continuous learning
- What is the detail of the applet development process?
- The business of digital collections is not so easy to do
- Reinforcement learning -- SARS in value learning
猜你喜欢

Deep learning (incremental learning) - (iccv) striking a balance between stability and plasticity for class incremental learning

深度学习(自监督:SimSiam)——Exploring Simple Siamese Representation Learning

分布式集群架构场景优化解决方案:Session共享问题

分布式集群架构场景化解决方案:集群时钟同步问题

强化学习——Proximal Policy Optimization Algorithms

4个角度教你选小程序开发工具?

UNL class diagram

3: MySQL master-slave replication setup

Four perspectives to teach you to choose applet development tools?

微信小程序制作模板套用时需要注意什么呢?
随机推荐
Assembly packaging
Deep learning pay attention to MLPs
NLP中基于Bert的数据预处理
微信小程序开发语言一般有哪些?
Briefly understand MVC and three-tier architecture
Centos7 installing MySQL
tf.keras搭建神经网络功能扩展
How to improve the efficiency of small program development?
Bert based data preprocessing in NLP
微信小程序制作模板套用时需要注意什么呢?
小程序开发如何提高效率?
分布式集群架构场景优化解决方案:Session共享问题
【1】 Introduction to redis
Quick look-up table to MD5
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly‘.format(pids_str))RuntimeErro
How much does small program development cost? Analysis of two development methods!
Utils commonly used in NLP
Distributed lock database implementation
Uview upload component upload upload auto upload mode image compression
Invalid packaging for parent POM x, must be “pom“ but is “jar“ @