当前位置:网站首页>Reinforcement learning - proximal policy optimization algorithms
Reinforcement learning - proximal policy optimization algorithms
2022-07-28 06:10:00 【Food to doubt life】
Preface
This paper gives a brief introduction to the thesis 《Proximal Policy Optimization Algorithms》 Summarize , If there is a mistake , Welcome to point out .
Why PPO
The mathematical expression of random strategy gradient is
∇ J ( θ ) = E S [ E A ∼ π ( . ∣ S ; θ ) [ Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] (1.0) \nabla J(\theta)=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\tag{1.0} ∇J(θ)=ES[EA∼π(.∣S;θ)[Qπ(S,A)∇θlnπ(A∣S;θ)]](1.0)
Unless otherwise mentioned , Article A A A Show action , S S S According to state , π \pi π Presentation strategy . Once used 1.0 Update policy network , It requires strategic network control agents to interact with the environment , In order to gain S S S And A A A, It's very time-consuming . If you can sample in advance before training S S S And A A A, And store it in the experience playback array , Then extract from the experience playback array S S S And A A A Training strategy network , This will greatly reduce the training time . Based on the above considerations , There is a Proximal Policy Optimization(PPO), Introducing PPO Before , Let's introduce TRPO.
TRPO
TPRO Use importance sampling , Approximate the random strategy gradient , So we can use the experience playback array to quickly update the strategy network . According to the mathematical definition of expectation , The mathematical form of importance sampling can be obtained :
E x ∼ p [ f ( x ) ] = E x ∼ q [ p ( x ) q ( x ) f ( x ) ] (2.0) E_{x \sim p}[f(x)]=E_{x \sim q}[\frac{p(x)}{q(x)}f(x)]\tag{2.0} Ex∼p[f(x)]=Ex∼q[q(x)p(x)f(x)](2.0)
It is worth mentioning that , type 2.0 Although the distributions on both sides of the equation are expected to be the same , But the variance is different :
V a r x ∼ p [ f ( x ) ] = E x ∼ p [ f ( x ) 2 ] − ( E x ∼ p [ f ( x ) ] ) 2 V a r x ∼ q [ p ( x ) q ( x ) f ( x ) ] = E x ∼ p [ f ( x ) 2 p ( x ) q ( x ) ] − ( E x ∼ p [ f ( x ) ] ) 2 \begin{aligned} Var_{x\sim p}[f(x)]&=E_{x\sim p}[f(x)^2]-(E_{x\sim p}[f(x)])^2\\ Var_{x\sim q}[\frac{p(x)}{q(x)}f(x)]&=E_{x\sim p}[f(x)^2\frac{p(x)}{q(x)}]-(E_{x\sim p}[f(x)])^2 \end{aligned} Varx∼p[f(x)]Varx∼q[q(x)p(x)f(x)]=Ex∼p[f(x)2]−(Ex∼p[f(x)])2=Ex∼p[f(x)2q(x)p(x)]−(Ex∼p[f(x)])2
When distribution p ( x ) p(x) p(x) And q ( x ) q(x) q(x) Approximate time , type 2.0 The variance of the distribution on both sides of the equation is also approximate .
Utilization 2.0 fitting 1.0 Make mathematical changes to get
∇ J ( θ ) = E S [ E A ∼ π ( . ∣ S ; θ ) [ Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] = E S [ E A ∼ π ′ ( . ∣ S ; θ o l d ) [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] ≈ E S [ E A ∼ π ′ ( . ∣ S ; θ o l d ) [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] (2.1) \begin{aligned} \nabla J(\theta)&=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\\ &=E_S[E_{A\sim \pi'(.|S;\theta_{old})}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\\ &\approx E_S[E_{A\sim \pi'(.|S;\theta_{old})}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]]\tag{2.1} \end{aligned} ∇J(θ)=ES[EA∼π(.∣S;θ)[Qπ(S,A)∇θlnπ(A∣S;θ)]]=ES[EA∼π′(.∣S;θold)[π′(A∣S;θold)π(A∣S;θ)Qπ(S,A)∇θlnπ(A∣S;θ)]]≈ES[EA∼π′(.∣S;θold)[π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)]](2.1)
When strategy π \pi π And strategy π ′ \pi' π′ Approximate time , Yes Q π ( S , A ) ≈ Q π ′ ( S , A ) Q_\pi(S,A)\approx Q_{\pi'}(S,A) Qπ(S,A)≈Qπ′(S,A), From this, we can get the formula 2.1 The third line in is approximately equal to the symbol . In order to make the strategy π \pi π And strategy π ′ \pi' π′ The approximate , therefore TPRO The strategy π \pi π And strategy π ′ \pi' π′ Of KL Divergence as a regular term . The overall loss function is
max L ( θ ) = max E A ∼ π ′ ( . ∣ S ; θ o l d ) , S [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) V π ( S ) ] − β K L ( π ( A ∣ S ; θ ) , π ′ ( A ∣ S ; θ o l d ) ) \begin{aligned} \max L(\theta)=\max E_{ {A\sim \pi'(.|S;\theta_{old})},S}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}V_{\pi}(S)]-\beta KL(\pi(A|S;\theta),\pi'(A|S;\theta_{old})) \end{aligned} maxL(θ)=maxEA∼π′(.∣S;θold),S[π′(A∣S;θold)π(A∣S;θ)Vπ(S)]−βKL(π(A∣S;θ),π′(A∣S;θold))
E A ∼ π ′ ( . ∣ S ; θ o l d ) , S [ π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) V π ( S ) ] E_{ {A\sim \pi'(.|S;\theta_{old})},S}[\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}V_{\pi}(S)] EA∼π′(.∣S;θold),S[π′(A∣S;θold)π(A∣S;θ)Vπ(S)] Yes θ \theta θ The derivative of is the formula 2.1, β \beta β Is a super parameter . Before training , Use strategy π ′ ( A ∣ S ; θ o l d ) \pi'(A|S;\theta_{old}) π′(A∣S;θold)( You can use past parameters θ o l d \theta_{old} θold Strategic network ) Control the interaction between agent and environment , Match a series of actions and States ( s t s_t st, a t a_t at) Stored in the experience playback array . During training , Extract a series of actions and states from the experience playback array , According to the above loss function , Use the gradient rise method to update the strategy network (KL How to calculate the divergence term can refer to the cross entropy ).
PPO
PPO Yes TRPO The optimization objective of has been changed , set up r ( θ ) = π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) r(\theta)=\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})} r(θ)=π′(A∣S;θold)π(A∣S;θ), A = V π ( S ) A=V_\pi(S) A=Vπ(S), Then the optimization goal ( Some symbols are omitted ) Turn into
max L c l i p ( θ ) = max E A ∼ π ′ , S [ min ( r ( θ ) A , c l i p ( r ( θ ) , 1 − ϵ , 1 + ϵ ) ) A ) ] (3.0) \begin{aligned} \max L^{clip}(\theta)=\max E_{A\sim \pi',S}[\min (r(\theta)A,clip(r(\theta),1-\epsilon,1+\epsilon))A)]\tag{3.0} \end{aligned} maxLclip(θ)=maxEA∼π′,S[min(r(θ)A,clip(r(θ),1−ϵ,1+ϵ))A)](3.0)
c l i p ( r ( θ ) , 1 − ϵ , 1 + ϵ ) ) clip(r(\theta),1-\epsilon,1+\epsilon)) clip(r(θ),1−ϵ,1+ϵ)) Said when r ( θ ) < 1 − ϵ r(\theta)<1-\epsilon r(θ)<1−ϵ when , Will be truncated to 1 − ϵ 1-\epsilon 1−ϵ, When r ( θ ) > 1 + ϵ r(\theta)>1+\epsilon r(θ)>1+ϵ when , Will be truncated to 1 + ϵ 1+\epsilon 1+ϵ. ϵ \epsilon ϵ Is a super parameter . type 3.0 The value image of is 
The basic idea is , When r ( θ ) r(\theta) r(θ) be located [ 1 − ϵ , 1 + ϵ ] [1-\epsilon,1+\epsilon] [1−ϵ,1+ϵ] when , Strategy π \pi π And strategy π ′ \pi' π′ It's more like , Present tense 2.1 The approximately equal sign in is established , At this time, the gradient is approximately equal to the random strategy gradient ( type 1.0). When r ( θ ) r(\theta) r(θ) Not located in [ 1 − ϵ , 1 + ϵ ] [1-\epsilon,1+\epsilon] [1−ϵ,1+ϵ] when , At this time, there is a large gap between the gradient and the random strategy gradient , The signal provided by the gradient may be wrong , Therefore, the range of updating parameters should be weaker .
Based on the above analysis , Let's see Figure 1. When A>0 when , if r ( θ ) r(\theta) r(θ) The value is greater than 1 + ϵ 1+\epsilon 1+ϵ, Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
0 < ( 1 − ϵ ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ( use On more new Of ladder degree ) < π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) 0<(1-\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )<\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta) 0<(1−ϵ)Qπ′(S,A)∇θlnπ(A∣S;θ)( use On more new Of ladder degree )<π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)
That is, use a smaller gradient to update the parameters . if r ( θ ) r(\theta) r(θ) Value is less than 1 − ϵ 1-\epsilon 1−ϵ when , Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
( 1 − ϵ ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) > π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ( use On more new Of ladder degree ) > 0 (1-\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)>\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update )>0 (1−ϵ)Qπ′(S,A)∇θlnπ(A∣S;θ)>π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)( use On more new Of ladder degree )>0
That is, use a smaller gradient to update the parameters . When A<0 when , if r ( θ ) r(\theta) r(θ) The value is greater than 1 + ϵ 1+\epsilon 1+ϵ, Gradients may contain incorrect information , At this time, the gradient of updating parameters meets
0 > ( 1 + ϵ ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) > π ( A ∣ S ; θ ) π ′ ( A ∣ S ; θ o l d ) Q π ′ ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ( use On more new Of ladder degree ) 0>(1+\epsilon)Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)>\frac{\pi(A|S;\theta)}{\pi'(A|S;\theta_{old})}Q_{\pi'}(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)( Gradient for update ) 0>(1+ϵ)Qπ′(S,A)∇θlnπ(A∣S;θ)>π′(A∣S;θold)π(A∣S;θ)Qπ′(S,A)∇θlnπ(A∣S;θ)( use On more new Of ladder degree )
At this time, it is found that even if the gradient contains error information ,PPO Still use it to update policy network parameters . An understanding of this is that since A < 0 A<0 A<0, It shows that this action is not advocated , Therefore, a larger gradient is still used to update the network , So that they try not to do this action , When r ( θ ) r(\theta) r(θ) Value is less than 1 − ϵ 1-\epsilon 1−ϵ The same goes for . The explanation of the original text is as follows
With this scheme, we only ignore the change in probability ratio when it
would make the objective improve, and we include it when it makes the objective worse
PPO stay A<0 Part of the way is a little against my intuition , in fact ,《Mastering Complex Control in MOBA Games with Deep Reinforcement Learning》 The article points out that PPO stay A<0 It will make the policy network difficult to converge , Thus it is proposed that Dual PPO, stay A<0 when , type 3.0 The value of is shown in the figure below (b), When the value is greater than c c c when , At this time, use a smaller gradient to update .
边栏推荐
- Alpine, Debian replacement source
- There is a problem with MySQL paging
- Quick look-up table to MD5
- 将项目部署到GPU上,并且运行
- Record the problems encountered in online capacity expansion server nochange: partition 1 is size 419428319. It cannot be grown
- KubeSphere安装版本问题
- 神经网络实现鸢尾花分类
- Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
- Dataset类分批加载数据集
- Deep learning (self supervision: simpl) -- a simple framework for contractual learning of visual representations
猜你喜欢

深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations

搭建集群之后崩溃的解决办法

Distributed lock redis implementation

【5】 Redis master-slave synchronization and redis sentinel (sentinel)

使用神经网络实现对天气的预测

神经网络实现鸢尾花分类

How much does it cost to make a small program mall? What are the general expenses?

卷积神经网络

pytorch深度学习单卡训练和多卡训练

深度学习(自监督:CPC v2)——Data-Efficient Image Recognition with Contrastive Predictive Coding
随机推荐
uView上传组件upload上传auto-upload模式图片压缩
NLP中基于Bert的数据预处理
无约束低分辨率人脸识别综述一:用于低分辨率人脸识别的数据集
知识点21-泛型
深度学习——Pay Attention to MLPs
使用神经网络实现对天气的预测
强化学习——策略学习
UNL class diagram
更新包与已安装应用签名不一致
小程序开发解决零售业的焦虑
小程序搭建制作流程是怎样的?
深度学习——Patches Are All You Need
NLP中常用的utils
Reinforcement learning - Multi-Agent Reinforcement Learning
2: Why read write separation
强化学习——价值学习中的SARSA
TensorFlow2.1基本概念与常见函数
小程序开发要多少钱?两种开发方法分析!
速查表之转MD5
分布式集群架构场景化解决方案:集群时钟同步问题