当前位置：网站首页>Reinforcement learning - continuous control

Reinforcement learning - continuous control

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
Continuous control

Preface

This paper summarizes 《 Deep reinforcement learning 》 The contents of the continuous control chapter in , If there is a mistake , Welcome to point out .

Continuous control

The reinforcement learning methods summarized in the previous blogs , The action space is discrete and finite . But the action space is not always discrete , It can also be continuous , For example, driving a vehicle , The action space of car steering angle is continuous . For the above problems , A feasible solution is to discretize the action space , besides , Reinforcement learning methods related to continuous control can be directly used . This paper will summarize and determine the strategy gradient algorithm （DPG）.

DPG

DPG It belongs to the method of strategic learning . To be specific ,DPG Use Actor-Critic frame , Use value network to assist the training of strategy network .DPG The method framework of is shown in the figure below
Insert picture description here
The input of policy network is status $s$ , Output is the specific action performed by the agent $a=\mu(s;\theta)$ . The method introduced above , The output of the strategy network is the probability of the agent performing each action , and DPG The output of is a definite value . State $s$ With the action $a=\mu(s;\theta)$ Input into the value network , Give the score of the action $q(s,a;\theta)$ .

DPG The goal of optimization

DPG The objective of the optimization is
$\max J(\theta)=\max E_S[q(S,\mu(S;\theta);w)]\tag{1.0}$
$\theta$ Parameters for the policy network . I hope no matter what state I face , The output of the strategy network can make the value network give a higher score . Therefore, the formula 1.0 The gradient of

$\nabla_\theta J(\theta)=E_S[\nabla_\theta \mu(S;\theta) \nabla_\theta q(S,\mu(S;\theta);w)]\tag{1.1}$

It is worth mentioning that ,DPG As a kind of strategy learning algorithm , The goal should be to maximize the value of the state value function , But 1.1 And Reinforcement learning —— Strategy learning The stochastic strategy gradient derived in Chapter $\nabla_\theta J'(\theta)$ Different , $\nabla_\theta J'(\theta)$ The mathematical expression of is
$\nabla_\theta J'(\theta)=E_S[E_{A\sim \pi(.|S;\theta)}[Q_\pi(S,A)\nabla_{\theta}\ln\pi(A|S;\theta)]] \tag{1.2}$
Set a definite strategy $\mu(S;\theta)$ The output of is $d$ Dimension vector , It's the first $i$ Elements are recorded as $\mu_i$ . Let the probability distribution of the output of the random strategy be
$\pi(a|s;\theta,\delta)=\prod_{i=1}^{d}\frac{1}{\sqrt{6.28}\delta}\exp (-\frac{[a_i-\mu_i]}{2\delta_i^2})$

When $\delta=[\delta_1、\delta_2、...、\delta_d]$ When it is a zero vector , There is （ See... For specific proof DPG The paper of 《Deterministic Policy Gradient Algorithms》）
$\lim_{\delta\to0}\nabla_\theta J'(\theta)=\nabla_\theta J(\theta)$
That is, determine the strategic gradient （ type 1.1） Is a random policy gradient （ type 1.2） A special case of , Optimization formula 1.1 It can also maximize the value of the state value function .

On-Policy DPG

On-Policy DPG( Same strategy DPG) Use Actor-Critic Framework training strategy network and value network , The specific steps are as follows

Observe the current state $s_t$ , Enter this status into the policy network , Get the action of the agent $\mu(s_{t};\theta)$ . After the agent executes the action, it gets a new state $s_{t+1}$ And rewards $r_t$ . State $s_{t+1}$ Input into the policy network , Get the action performed by the agent $\mu(s_{t+1};\theta)$ .
Calculation $\hat q_t=q_\pi(s_t,\mu(s_{t};\theta);w_{now})$ 、 $\hat q_{t+1}=q_\pi(s_{t+1},\mu(s_{t+1};\theta);w_{now})$
Optimize the value network by using Behrman equation $q (s, a; w)$ $w_{new}=w_{now}-\alpha [\hat q_t-(r_t+\hat q_{t+1})]\nabla_{w}q(s_t,\mu(s_{t};\theta);w_{now})$
Update policy network $\theta_{new}=\theta_{now}+\beta \nabla_{\theta}\hat q_t \nabla_{\theta}\mu(s_t;\theta)$

Off-Policy DPG

DPG The action of the policy network output is certain , Therefore, the same strategy DPG It is difficult to fully explore the environment , The network may converge to a local minimum . It is worth mentioning that , The output action of the policy network of random policy gradient is probability distribution , Sampling actions based on probability （ Actions with low probability may also be sampled ） Agents can fully explore the environment .

Different strategies DPG（Off-Policy DPG） Solve the same strategy DPG It is difficult to fully explore environmental issues . It is worth mentioning that , Same strategy DPG The value network of is fitted with the action value function , Different strategies DPG The value network of is fitted with the optimal action value function . Same strategy DPG Use SARSA Algorithm training value network , Different strategies DPG Use Q-learning Training value networks .

Different strategies DPG The process of training strategy network and value network is

Before starting training , Using strategic networks to control the movement of agents in the environment , Get a series of quads ( $s_t,a_t,s_{t+1},a_{t+1}$ ), All four tuples form an empirical playback array
Extract quads from the experience playback array ( $s_t,a_t,s_{t+1},a_{t+1}$ ), Calculate through policy network
$\hat a_t=\mu(s_t;\theta_{now})\ \ \ \hat a_{t+1}=\mu(s_{t+1};\theta_{now})$
Use the value network to calculate （ Pay attention to the symbols of actions ）
$\hat q_t=q(s_t,a_t;w_{now}) \ \ \ \hat q_{t+1}=q(s_{t+1},\hat a_{t+1};w_{now})$
Update the parameters of the value network
$w_{new}=w_{now}-\alpha [\hat q_t-(r_t+\hat q_{t+1})]\nabla_{w}q(s_t,\mu(s_{t};\theta);w_{now})$
Update the parameters of the policy network
$\theta_{new}=\theta_{now}+\beta \nabla_{\theta}q(s_t,\hat a_t;w) \nabla_{\theta}\mu(s_t;\theta_{now})$

It is worth mentioning that , Different strategies DPG Let the value network fit the optimal action value function $Q_*(s,a;\theta)$ , Therefore, it hopes that the action of policy network output is
$\mu(s;\theta)\approx \argmax_a Q_*(s,a;\theta)$
Due to different strategies DPG The value network of is optimized using the optimal Behrman equation , Therefore, its existence is maximized 、 Overestimation caused by bootstrapping （ Can browse Reinforcement learning —— Value learning DQN chapter ）. For such problems , have access to Twin Delayed Deep Deterministic Policy Gradient（TD3） solve .TD3 There are two value networks , Two target value networks , A strategic network , A target strategy , The specific training process is

Initial stage , Randomly initialize the parameters of two target Networks $w_1、w_2$ And the parameters of the policy network $\theta$ . Then initialize the parameters of the two target value networks $w_1^-、w_2^-$ And the parameters of the target policy network $\theta^-$ by
$w_1^-=w_1\\ w_2^-=w_2\\ \theta^-=\theta$
Before starting training , Use some strategy to control the interaction between agent and environment , Get a series of quads ( $s_t,a_t,r_t,s_{t+1}$ ), These quaternions form an empirical playback array .
During training , Extract a quadruple from the experience playback array ( $s_t,a_t,r_t,s_{t+1}$ ), Let the target strategy network calculate
$\hat a_{j+1}^-=\mu(s_{j+1};\theta_{now}^-)+\epsilon$ among $\epsilon$ To truncate the random noise extracted from the independent normal distribution , This step is considered to alleviate the overestimation caused by maximization
Let the two target value networks predict , This step is used to alleviate the overestimation caused by bootstrapping
$\begin{aligned} \hat q_{1,j+1}^-&=q(s_{j+1},\hat a_{j+1}^-;w_{1,now}^-)\\ \hat q_{2,j+1}^-&=q(s_{j+1},\hat a_{j+1}^-;w_{2,now}^-) \end{aligned}$
Calculation TD error $\hat y_j=r_j+\min\{\hat q_{1,j+1}^-,\hat q_{2,j+1}^-\}$
Update two value networks
$\begin{aligned} w_{1,new}&=w_{1,now}-\alpha (\hat q_{1,j+1}^--\hat y_j) \nabla_{w_1} q(s_{j},a_{j};w_{1,now})\\ w_{2,new}&=w_{2,now}-\alpha (\hat q_{2,j+1}^--\hat y_j) \nabla_{w_2} q(s_{j},a_{j};w_{2,now}) \end{aligned}$
every other k The policy network and three target networks are updated once
- Let policy network computing $\hat a_t=\mu(s_t;\theta_{now})$ , Then update the policy network $\theta_{new}=\theta_{now}+\beta \nabla_{\theta}q(s_t,\hat a_t;w) \nabla_{\theta}\mu(s_t;\theta_{now})$
- Update three strategy network parameters in momentum mode , $\gamma$ Is a super parameter
  $\begin{aligned} \theta_{new}^-&=\gamma \theta_{new}+(1-\gamma)\theta_{now}^-\\ w_{1,new}^-&=\gamma w_{1,new}+(1-\gamma)w_{1,now}^-\\ w_{2,new}^-&=\gamma w_{2,new}+(1-\gamma)w_{2,now}^- \end{aligned}$

Random Gaussian strategy

Remove the use of DPG Solve the continuous control problem , Random Gauss strategy can also be used to solve . Random Gaussian strategy assumes that the strategy function obeys Gaussian distribution ：
$\pi(a|s;\theta,\delta)=\prod_{i=1}^{d}\frac{1}{\sqrt{6.28}\delta}\exp (-\frac{[a_i-\mu_i]}{2\delta_i^2})$

It uses two neural networks $\mu(s;\theta)$ 、 $\rho(s;\theta)$ Fit the mean of Gaussian distribution $\mu$ And logarithmic variance $\ln\delta$ , Mean and variance Neural Networks （ Also known as auxiliary network ） The structure diagram of is
Insert picture description here
The training process of random Gaussian strategy is

Observe the current state $s_t$ , Calculate the mean 、 variance $\mu(s_t;\theta)$ 、 $\exp(\rho(s_t;\theta))$ , Sample actions from Gaussian distribution $a$
Calculate the action value function $Q_{\pi}(s,a)$
Back propagation is used to calculate the parameters of the auxiliary network $\theta$ Gradient of $\nabla_{\theta}\ln \pi(a|s;\theta)$
Calculate the strategy gradient
$Q_{\pi}(s,a)\nabla_{\theta}\ln \pi(a|s;\theta)$
Update the parameters of the auxiliary network with the gradient rising method
$\theta_{new}=\theta_{now}+\beta Q_{\pi}(s,a)\nabla_{\theta}\ln \pi(a|s;\theta)$

In the above training process $Q_{\pi}(s,a)$ You can use REINFORCE Method or Actor-Critic Method approximation , Refer to Reinforcement learning —— Strategy learning .

When testing , State $S$ Input into convolutional neural network , The output of convolutional neural network passes through two parallel fully connected networks , Get the mean and logarithmic variance of Gaussian distribution $\mu(s;\theta)$ 、 $\rho(s;\theta)$ , Then the actions performed by the agent are sampled from the Gaussian distribution .