当前位置:网站首页>Reinforcement learning - continuous control
Reinforcement learning - continuous control
2022-07-28 06:10:00 【Food to doubt life】
List of articles
Preface
This paper summarizes 《 Deep reinforcement learning 》 The contents of the continuous control chapter in , If there is a mistake , Welcome to point out .
Continuous control
The reinforcement learning methods summarized in the previous blogs , The action space is discrete and finite . But the action space is not always discrete , It can also be continuous , For example, driving a vehicle , The action space of car steering angle is continuous . For the above problems , A feasible solution is to discretize the action space , besides , Reinforcement learning methods related to continuous control can be directly used . This paper will summarize and determine the strategy gradient algorithm (DPG).
DPG
DPG It belongs to the method of strategic learning . To be specific ,DPG Use Actor-Critic frame , Use value network to assist the training of strategy network .DPG The method framework of is shown in the figure below 
The input of policy network is status s s s, Output is the specific action performed by the agent a = μ ( s ; θ ) a=\mu(s;\theta) a=μ(s;θ). The method introduced above , The output of the strategy network is the probability of the agent performing each action , and DPG The output of is a definite value . State s s s With the action a = μ ( s ; θ ) a=\mu(s;\theta) a=μ(s;θ) Input into the value network , Give the score of the action q ( s , a ; θ ) q(s,a;\theta) q(s,a;θ).
DPG The goal of optimization
DPG The objective of the optimization is
max J ( θ ) = max E S [ q ( S , μ ( S ; θ ) ; w ) ] (1.0) \max J(\theta)=\max E_S[q(S,\mu(S;\theta);w)]\tag{1.0} maxJ(θ)=maxES[q(S,μ(S;θ);w)](1.0)
θ \theta θ Parameters for the policy network . I hope no matter what state I face , The output of the strategy network can make the value network give a higher score . Therefore, the formula 1.0 The gradient of
∇ θ J ( θ ) = E S [ ∇ θ μ ( S ; θ ) ∇ θ q ( S , μ ( S ; θ ) ; w ) ] (1.1) \nabla_\theta J(\theta)=E_S[\nabla_\theta \mu(S;\theta) \nabla_\theta q(S,\mu(S;\theta);w)]\tag{1.1} ∇θJ(θ)=ES[∇θμ(S;θ)∇θq(S,μ(S;θ);w)](1.1)
It is worth mentioning that ,DPG As a kind of strategy learning algorithm , The goal should be to maximize the value of the state value function , But 1.1 And Reinforcement learning —— Strategy learning The stochastic strategy gradient derived in Chapter ∇ θ J ′ ( θ ) \nabla_\theta J'(\theta) ∇θJ′(θ) Different , ∇ θ J ′ ( θ ) \nabla_\theta J'(\theta) ∇θJ′(θ) The mathematical expression of is
∇ θ J ′ ( θ ) = E S [ E A ∼ π ( . ∣ S ; θ ) [ Q π ( S , A ) ∇ θ ln π ( A ∣ S ; θ ) ] ] (1.2) \nabla_\theta J'(\theta)=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]] \tag{1.2} ∇θJ′(θ)=ES[EA∼π(.∣S;θ)[Qπ(S,A)∇θlnπ(A∣S;θ)]](1.2)
Set a definite strategy μ ( S ; θ ) \mu(S;\theta) μ(S;θ) The output of is d d d Dimension vector , It's the first i i i Elements are recorded as μ i \mu_i μi. Let the probability distribution of the output of the random strategy be
π ( a ∣ s ; θ , δ ) = ∏ i = 1 d 1 6.28 δ exp ( − [ a i − μ i ] 2 δ i 2 ) \pi(a|s;\theta,\delta)=\prod_{i=1}^{d}\frac{1}{\sqrt{6.28}\delta}\exp (-\frac{[a_i-\mu_i]}{2\delta_i^2}) π(a∣s;θ,δ)=i=1∏d6.28δ1exp(−2δi2[ai−μi])
When δ = [ δ 1 、 δ 2 、 . . . 、 δ d ] \delta=[\delta_1、\delta_2、...、\delta_d] δ=[δ1、δ2、...、δd] When it is a zero vector , There is ( See... For specific proof DPG The paper of 《Deterministic Policy Gradient Algorithms》)
lim δ → 0 ∇ θ J ′ ( θ ) = ∇ θ J ( θ ) \lim_{\delta\to0}\nabla_\theta J'(\theta)=\nabla_\theta J(\theta) δ→0lim∇θJ′(θ)=∇θJ(θ)
That is, determine the strategic gradient ( type 1.1) Is a random policy gradient ( type 1.2) A special case of , Optimization formula 1.1 It can also maximize the value of the state value function .
On-Policy DPG
On-Policy DPG( Same strategy DPG) Use Actor-Critic Framework training strategy network and value network , The specific steps are as follows
- Observe the current state s t s_t st, Enter this status into the policy network , Get the action of the agent μ ( s t ; θ ) \mu(s_{t};\theta) μ(st;θ). After the agent executes the action, it gets a new state s t + 1 s_{t+1} st+1 And rewards r t r_t rt. State s t + 1 s_{t+1} st+1 Input into the policy network , Get the action performed by the agent μ ( s t + 1 ; θ ) \mu(s_{t+1};\theta) μ(st+1;θ).
- Calculation q ^ t = q π ( s t , μ ( s t ; θ ) ; w n o w ) \hat q_t=q_\pi(s_t,\mu(s_{t};\theta);w_{now}) q^t=qπ(st,μ(st;θ);wnow)、 q ^ t + 1 = q π ( s t + 1 , μ ( s t + 1 ; θ ) ; w n o w ) \hat q_{t+1}=q_\pi(s_{t+1},\mu(s_{t+1};\theta);w_{now}) q^t+1=qπ(st+1,μ(st+1;θ);wnow)
- Optimize the value network by using Behrman equation q ( s , a ; w ) q(s,a;w) q(s,a;w) w n e w = w n o w − α [ q ^ t − ( r t + q ^ t + 1 ) ] ∇ w q ( s t , μ ( s t ; θ ) ; w n o w ) w_{new}=w_{now}-\alpha [\hat q_t-(r_t+\hat q_{t+1})]\nabla_{w}q(s_t,\mu(s_{t};\theta);w_{now}) wnew=wnow−α[q^t−(rt+q^t+1)]∇wq(st,μ(st;θ);wnow)
- Update policy network θ n e w = θ n o w + β ∇ θ q ^ t ∇ θ μ ( s t ; θ ) \theta_{new}=\theta_{now}+\beta \nabla_{\theta}\hat q_t \nabla_{\theta}\mu(s_t;\theta) θnew=θnow+β∇θq^t∇θμ(st;θ)
Off-Policy DPG
DPG The action of the policy network output is certain , Therefore, the same strategy DPG It is difficult to fully explore the environment , The network may converge to a local minimum . It is worth mentioning that , The output action of the policy network of random policy gradient is probability distribution , Sampling actions based on probability ( Actions with low probability may also be sampled ) Agents can fully explore the environment .
Different strategies DPG(Off-Policy DPG) Solve the same strategy DPG It is difficult to fully explore environmental issues . It is worth mentioning that , Same strategy DPG The value network of is fitted with the action value function , Different strategies DPG The value network of is fitted with the optimal action value function . Same strategy DPG Use SARSA Algorithm training value network , Different strategies DPG Use Q-learning Training value networks .
Different strategies DPG The process of training strategy network and value network is
- Before starting training , Using strategic networks to control the movement of agents in the environment , Get a series of quads ( s t , a t , s t + 1 , a t + 1 s_t,a_t,s_{t+1},a_{t+1} st,at,st+1,at+1), All four tuples form an empirical playback array
- Extract quads from the experience playback array ( s t , a t , s t + 1 , a t + 1 s_t,a_t,s_{t+1},a_{t+1} st,at,st+1,at+1), Calculate through policy network
a ^ t = μ ( s t ; θ n o w ) a ^ t + 1 = μ ( s t + 1 ; θ n o w ) \hat a_t=\mu(s_t;\theta_{now})\ \ \ \hat a_{t+1}=\mu(s_{t+1};\theta_{now}) a^t=μ(st;θnow) a^t+1=μ(st+1;θnow) - Use the value network to calculate ( Pay attention to the symbols of actions )
q ^ t = q ( s t , a t ; w n o w ) q ^ t + 1 = q ( s t + 1 , a ^ t + 1 ; w n o w ) \hat q_t=q(s_t,a_t;w_{now}) \ \ \ \hat q_{t+1}=q(s_{t+1},\hat a_{t+1};w_{now}) q^t=q(st,at;wnow) q^t+1=q(st+1,a^t+1;wnow) - Update the parameters of the value network
w n e w = w n o w − α [ q ^ t − ( r t + q ^ t + 1 ) ] ∇ w q ( s t , μ ( s t ; θ ) ; w n o w ) w_{new}=w_{now}-\alpha [\hat q_t-(r_t+\hat q_{t+1})]\nabla_{w}q(s_t,\mu(s_{t};\theta);w_{now}) wnew=wnow−α[q^t−(rt+q^t+1)]∇wq(st,μ(st;θ);wnow) - Update the parameters of the policy network
θ n e w = θ n o w + β ∇ θ q ( s t , a ^ t ; w ) ∇ θ μ ( s t ; θ n o w ) \theta_{new}=\theta_{now}+\beta \nabla_{\theta}q(s_t,\hat a_t;w) \nabla_{\theta}\mu(s_t;\theta_{now}) θnew=θnow+β∇θq(st,a^t;w)∇θμ(st;θnow)
It is worth mentioning that , Different strategies DPG Let the value network fit the optimal action value function Q ∗ ( s , a ; θ ) Q_*(s,a;\theta) Q∗(s,a;θ), Therefore, it hopes that the action of policy network output is
μ ( s ; θ ) ≈ arg max a Q ∗ ( s , a ; θ ) \mu(s;\theta)\approx \argmax_a Q_*(s,a;\theta) μ(s;θ)≈aargmaxQ∗(s,a;θ)
Due to different strategies DPG The value network of is optimized using the optimal Behrman equation , Therefore, its existence is maximized 、 Overestimation caused by bootstrapping ( Can browse Reinforcement learning —— Value learning DQN chapter ). For such problems , have access to Twin Delayed Deep Deterministic Policy Gradient(TD3) solve .TD3 There are two value networks , Two target value networks , A strategic network , A target strategy , The specific training process is
Initial stage , Randomly initialize the parameters of two target Networks w 1 、 w 2 w_1、w_2 w1、w2 And the parameters of the policy network θ \theta θ. Then initialize the parameters of the two target value networks w 1 − 、 w 2 − w_1^-、w_2^- w1−、w2− And the parameters of the target policy network θ − \theta^- θ− by
w 1 − = w 1 w 2 − = w 2 θ − = θ w_1^-=w_1\\ w_2^-=w_2\\ \theta^-=\theta w1−=w1w2−=w2θ−=θBefore starting training , Use some strategy to control the interaction between agent and environment , Get a series of quads ( s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1), These quaternions form an empirical playback array .
During training , Extract a quadruple from the experience playback array ( s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1), Let the target strategy network calculate
a ^ j + 1 − = μ ( s j + 1 ; θ n o w − ) + ϵ \hat a_{j+1}^-=\mu(s_{j+1};\theta_{now}^-)+\epsilon a^j+1−=μ(sj+1;θnow−)+ϵ among ϵ \epsilon ϵ To truncate the random noise extracted from the independent normal distribution , This step is considered to alleviate the overestimation caused by maximizationLet the two target value networks predict , This step is used to alleviate the overestimation caused by bootstrapping
q ^ 1 , j + 1 − = q ( s j + 1 , a ^ j + 1 − ; w 1 , n o w − ) q ^ 2 , j + 1 − = q ( s j + 1 , a ^ j + 1 − ; w 2 , n o w − ) \begin{aligned} \hat q_{1,j+1}^-&=q(s_{j+1},\hat a_{j+1}^-;w_{1,now}^-)\\ \hat q_{2,j+1}^-&=q(s_{j+1},\hat a_{j+1}^-;w_{2,now}^-) \end{aligned} q^1,j+1−q^2,j+1−=q(sj+1,a^j+1−;w1,now−)=q(sj+1,a^j+1−;w2,now−)Calculation TD error y ^ j = r j + min { q ^ 1 , j + 1 − , q ^ 2 , j + 1 − } \hat y_j=r_j+\min\{\hat q_{1,j+1}^-,\hat q_{2,j+1}^-\} y^j=rj+min{ q^1,j+1−,q^2,j+1−}
Update two value networks
w 1 , n e w = w 1 , n o w − α ( q ^ 1 , j + 1 − − y ^ j ) ∇ w 1 q ( s j , a j ; w 1 , n o w ) w 2 , n e w = w 2 , n o w − α ( q ^ 2 , j + 1 − − y ^ j ) ∇ w 2 q ( s j , a j ; w 2 , n o w ) \begin{aligned} w_{1,new}&=w_{1,now}-\alpha (\hat q_{1,j+1}^--\hat y_j) \nabla_{w_1} q(s_{j},a_{j};w_{1,now})\\ w_{2,new}&=w_{2,now}-\alpha (\hat q_{2,j+1}^--\hat y_j) \nabla_{w_2} q(s_{j},a_{j};w_{2,now}) \end{aligned} w1,neww2,new=w1,now−α(q^1,j+1−−y^j)∇w1q(sj,aj;w1,now)=w2,now−α(q^2,j+1−−y^j)∇w2q(sj,aj;w2,now)every other k The policy network and three target networks are updated once
- Let policy network computing a ^ t = μ ( s t ; θ n o w ) \hat a_t=\mu(s_t;\theta_{now}) a^t=μ(st;θnow), Then update the policy network θ n e w = θ n o w + β ∇ θ q ( s t , a ^ t ; w ) ∇ θ μ ( s t ; θ n o w ) \theta_{new}=\theta_{now}+\beta \nabla_{\theta}q(s_t,\hat a_t;w) \nabla_{\theta}\mu(s_t;\theta_{now}) θnew=θnow+β∇θq(st,a^t;w)∇θμ(st;θnow)
- Update three strategy network parameters in momentum mode , γ \gamma γ Is a super parameter
θ n e w − = γ θ n e w + ( 1 − γ ) θ n o w − w 1 , n e w − = γ w 1 , n e w + ( 1 − γ ) w 1 , n o w − w 2 , n e w − = γ w 2 , n e w + ( 1 − γ ) w 2 , n o w − \begin{aligned} \theta_{new}^-&=\gamma \theta_{new}+(1-\gamma)\theta_{now}^-\\ w_{1,new}^-&=\gamma w_{1,new}+(1-\gamma)w_{1,now}^-\\ w_{2,new}^-&=\gamma w_{2,new}+(1-\gamma)w_{2,now}^- \end{aligned} θnew−w1,new−w2,new−=γθnew+(1−γ)θnow−=γw1,new+(1−γ)w1,now−=γw2,new+(1−γ)w2,now−
Random Gaussian strategy
Remove the use of DPG Solve the continuous control problem , Random Gauss strategy can also be used to solve . Random Gaussian strategy assumes that the strategy function obeys Gaussian distribution :
π ( a ∣ s ; θ , δ ) = ∏ i = 1 d 1 6.28 δ exp ( − [ a i − μ i ] 2 δ i 2 ) \pi(a|s;\theta,\delta)=\prod_{i=1}^{d}\frac{1}{\sqrt{6.28}\delta}\exp (-\frac{[a_i-\mu_i]}{2\delta_i^2}) π(a∣s;θ,δ)=i=1∏d6.28δ1exp(−2δi2[ai−μi])
It uses two neural networks μ ( s ; θ ) \mu(s;\theta) μ(s;θ)、 ρ ( s ; θ ) \rho(s;\theta) ρ(s;θ) Fit the mean of Gaussian distribution μ \mu μ And logarithmic variance ln δ \ln\delta lnδ, Mean and variance Neural Networks ( Also known as auxiliary network ) The structure diagram of is 
The training process of random Gaussian strategy is
- Observe the current state s t s_t st, Calculate the mean 、 variance μ ( s t ; θ ) \mu(s_t;\theta) μ(st;θ)、 exp ( ρ ( s t ; θ ) ) \exp(\rho(s_t;\theta)) exp(ρ(st;θ)), Sample actions from Gaussian distribution a a a
- Calculate the action value function Q π ( s , a ) Q_{\pi}(s,a) Qπ(s,a)
- Back propagation is used to calculate the parameters of the auxiliary network θ \theta θ Gradient of ∇ θ ln π ( a ∣ s ; θ ) \nabla_{\theta}\ln \pi(a|s;\theta) ∇θlnπ(a∣s;θ)
- Calculate the strategy gradient
Q π ( s , a ) ∇ θ ln π ( a ∣ s ; θ ) Q_{\pi}(s,a)\nabla_{\theta}\ln \pi(a|s;\theta) Qπ(s,a)∇θlnπ(a∣s;θ) - Update the parameters of the auxiliary network with the gradient rising method
θ n e w = θ n o w + β Q π ( s , a ) ∇ θ ln π ( a ∣ s ; θ ) \theta_{new}=\theta_{now}+\beta Q_{\pi}(s,a)\nabla_{\theta}\ln \pi(a|s;\theta) θnew=θnow+βQπ(s,a)∇θlnπ(a∣s;θ)
In the above training process Q π ( s , a ) Q_{\pi}(s,a) Qπ(s,a) You can use REINFORCE Method or Actor-Critic Method approximation , Refer to Reinforcement learning —— Strategy learning .
When testing , State S S S Input into convolutional neural network , The output of convolutional neural network passes through two parallel fully connected networks , Get the mean and logarithmic variance of Gaussian distribution μ ( s ; θ ) \mu(s;\theta) μ(s;θ)、 ρ ( s ; θ ) \rho(s;\theta) ρ(s;θ), Then the actions performed by the agent are sampled from the Gaussian distribution .
边栏推荐
- SQLAlchemy使用相关
- Kotlin语言现在怎么不火了?你怎么看?
- 深度学习——MetaFormer Is Actually What You Need for Vision
- Marsnft: how do individuals distribute digital collections?
- Matplotlib data visualization
- Briefly understand MVC and three-tier architecture
- How much does it cost to make a small program mall? What are the general expenses?
- 深度学习(自监督:SimSiam)——Exploring Simple Siamese Representation Learning
- XML parsing entity tool class
- 小程序开发哪家更靠谱呢?
猜你喜欢

微信小程序制作模板套用时需要注意什么呢?

Flink CDC (MySQL as an example)

UNL class diagram

Distributed cluster architecture scenario optimization solution: distributed ID solution

小程序制作小程序开发适合哪些企业?

tf.keras搭建神经网络功能扩展

How much does it cost to make a small program mall? What are the general expenses?

知识点21-泛型

Micro service architecture cognition and service governance Eureka

高端大气的小程序开发设计有哪些注意点?
随机推荐
强化学习——基础概念
【1】 Introduction to redis
速查表之各种编程语言小数|时间|base64等操作
Regular verification rules of wechat applet mobile number
深度学习(自监督:CPC v2)——Data-Efficient Image Recognition with Contrastive Predictive Coding
使用神经网络实现对天气的预测
Reinforcement learning - Strategic Learning
What is the detail of the applet development process?
How to improve the efficiency of small program development?
神经网络实现鸢尾花分类
Utils commonly used in NLP
Reinforcement learning - Multi-Agent Reinforcement Learning
Dataset class loads datasets in batches
Tensorboard visualization
分布式集群架构场景优化解决方案:分布式ID解决方案
深度学习——Patches Are All You Need
tf.keras搭建神经网络功能扩展
Distinguish between real-time data, offline data, streaming data and batch data
深度学习(自监督:SimSiam)——Exploring Simple Siamese Representation Learning
alpine,debian替换源