当前位置：网站首页>Reinforcement learning - learning notes 4 | actor critical

Reinforcement learning - learning notes 4 | actor critical

2022-07-05 19:17:00 【climerecho】

Actor-Critic It is the combination of value learning and strategy learning .Actor It's a strategic network , Used to control agent motion , It can be regarded as an athlete .Critic It's a value network , Used to score actions , Like a referee .

4. Actor-Critic

4.1 Value network and strategy network construction

a. Principle introduction

State value function ：

$ V_\pi(s)=\sum_{{a}}\pi({a}|{s})\cdot Q_\pi({s},{a})$ ( Discrete situation , If it is continuous, it needs to be replaced by definite integral )

V Is the action value function $Q_\pi$ The expectations of the ,$\pi({s}|{a})$ Policy function control agent take exercises ,$Q_\pi({s},{a})$ The value function evaluates whether the action is good or bad . But we don't know these two functions , But we can use a neural network to approximate these two functions , And then use Actor-Critic Methods to learn these two networks at the same time .

Policy network (actor)： With the Internet $\pi({s}|{a};\theta)$ To approximate $\pi({s}|{a})$,$\theta$ It's network parameters
Value network (critic)： With the Internet $q({s},{a};w)$ To approximate $Q_\pi({s},{a})$,$w$ It's network parameters

actor Is a gymnast , You can do it yourself , and agent Want to do better , But I don't know how to improve , This requires the referee to score her , In this way, athletes will know what kind of action has a high score , What kind of action has a low score , In this way, you can improve yourself , Let the score get higher and higher .

such ：$ V_\pi({s})=\sum_{{a}}\pi({a}|{s})\cdot Q_\pi({s},{a})\approx\sum_a\pi(a|s;\theta)\cdot q(s,a;w)$$

b. Actor build

Policy network .

Input ： state s
Output ： Probability distribution of possible actions
$\mathcal{A}$ It's an action set , Such as $\mathcal{A}$ ={ "left","right","up"}
$\sum_{a\in\mathcal{A}}\pi(a|s,\theta)=1$

Convolution layer Conv hold state become An eigenvector feature , Use one or more full connection layers Dense Put the eigenvector Map to purple , After normalization, the probability of each action is obtained .

c. Critic build

Input ： There are two , state s And the action a

Output ： Approximate action value function (scalar)

If action Is discrete , It can be used one-hot coding To express , For example, left is [1,0,0], To the right is [0,1,0] ······ Features are extracted from input using convolution layer and full connection layer respectively , Get two eigenvectors , Then join the two eigenvectors , Get a higher eigenvector , Finally, output a real number with a full connection layer , This real number is the score given by the referee to the athlete . This action shows , In state s Under the circumstances , Make an action a Good or bad . This value network can share convolution layer parameters with the policy network , It can also be completely independent of the strategic network .

4.2 Actor-Critic Method

At the same time, the training strategy network and action network are called Actor-Critic Method.

Definition ： Use neural networks to approximate Two value functions

Training ： Update parameters $\theta、w$

Update policy network $\pi({s}|{a};\theta)$ It's to make $V({s};\theta,w)$ The value of the increase
- The supervision signal is only Provided by the value network
- Athletes actor According to the referee critic To continuously improve their level
Update the value network $q({s},{a};w)$ To make the scoring more accurate
- The supervision signal is only Rewards from the environment
- At first, the referee scored randomly , But it will improve the scoring level according to the rewards given by the environment , Make it close to the real score .

Step summary ：

Observation status $s_t$
$s_t$ As input , According to the policy network $\pi({\cdot}|{s_t};\theta_t)$ Randomly sample an action $a_t$
Implement actions $a_t$ And observe the new state $s_{t+1}$ And rewards $r_t$
Reward with $r_t$ adopt TD Algorithm Update in the value network w, Make the referee more accurate
Use Strategy gradient algorithm Update in the policy network $\theta$, Make athletes better

a. TD Update the value network

Use the value network Q to action $a_t、a_{t+1}$ Scoring , Computation $q({s_t}, {a_t} ;w_t)$ And $q({s_{t+1}},{a_{t+1}};w_t)$
Calculation TD target：$y_t = {r_t} + \gamma \cdot q({s_{t+1}},{a_{t+1}};w_t)$,
Compare the strategy learning in the previous note , Monte Carlo is needed to approximate $q({s_t},{a_t} ;w_t)$, It is more realistic to use value network to approximate .
The loss function is the difference between the predicted value and some real values .
Loss value : $L(w)=\frac{1}{2}[q({s_t},{a_t};w)-y_t]^2$
gradient descent ：$W_{t+1} = w_t -\alpha \cdot \frac{\partial L(w)}{\partial w}|w=w_t$

b. Policy gradient updates policy network

State value function V Equivalent to the average score of all athletes ：

$V({s};\theta,w)=\sum_a\pi({s}|{a};\theta)\cdot q({s},{a};w)$

Policy gradient ： function $V({s};\theta,w)$ About parameters $\theta$ The derivative of

$g({a},\theta)=\frac{\partial log \pi({a}|{s};\theta)}{\partial \theta} \cdot q({s},{a};w)$, here q Equivalent to the score of the referee
$\frac{\partial V(s;\theta;w_t)}{\partial \theta} = \mathbb{E}_{A}[g({A},\theta)]$ The strategy gradient is equivalent to a function g Expect , hold A It's gone , But it's hard to expect , Use Monte Carlo approximation to sample and solve .

Algorithm ：

According to the random sampling of the strategy network, we get the action a ：${a} \sim \pi(\cdot|{s_t};\theta_t)$ .

about $\pi$ Random sampling ensures $g(a,\theta)$ It's an unbiased estimate

With random gradients g, You can do a gradient rise ：$\theta_{t+1} = \theta_t + \beta \cdot g({a},\theta_t)$, here $\beta$ It's the learning rate .

c. Process carding

Now let's sort out the process with the examples of athletes and referees ：

First , Athletes （ The policy network on the left ） Observe the current state s , control agent Make an action a; Athletes want to improve , But it doesn't know how to get better （ Or there is no criterion ）, Therefore, the referee is introduced to evaluate the athletes ：

After the athlete makes the action , Referee （ Value network ） Will be based on a and s For athletes （ Policy network ） Give a score q, In this way, the athletes are based on q To improve yourself ：

Athlete's “ technology ” refer to Parameters in the policy network , We previously believed that the better the parameters , The better our effect . In this model , Athletes get s、q as well as a To calculate Policy gradient , Update parameters through gradient rise . By improving the “ technology ”, Average score of athletes （ Namely value q） It's going to get higher and higher .

But here's the thing ： thus , Athletes are always judged by the referee , His standard q The referee gave it . Average score of athletes q The higher the level, the higher the level . We also need to improve the level of referees .

Add an idea ： I think actor-critic The idea of is two parts , First, let's The criterion is closer to God Ideas , Two is Under the given criteria Let the execution effect get greater scores .

about Value network / The referee Come on The initial scoring is random . The referee depends on Reward r To improve the level of scoring , Here are rewards r Equivalent to God's judgment . The value network is based on a、s、r To give scores q , Through the scores of two adjacent times $q_t、q_{t+1}$ as well as $r_t$, Use TD Algorithm To update the parameters , Improve the effect

d. Algorithm is summarized

Let's make a little formal summary ：

Observe the old state $s_t$, According to the policy network $\pi({\cdot}|{s_t};\theta_t)$ Randomly sample an action $a_t$
agent Executive action $a_t$; The environment will tell us the new state $s_{t+1}$ And rewards $r_t$
Get a new state $s_{t+1}$ As input , Use the policy network $\pi$ Calculate new probabilities and randomly sample new actions ：${\tilde{a}_{t+1}} \sim \pi({\cdot}|{s_{t+1}};\theta_t)$, This action is only an imaginary action ,agent Not execute , Just use it to calculate Q value .
Next, calculate the output of the value network twice ：$q_t=q({s_t},{a_t};w_t) and q_{t+1}=q({s_{t+1}},{\tilde{a}_{t+1}};w_t)$,${\tilde{a}_{t+1}}$ It's gone , It doesn't really work ;
Calculation TD error：$\delta_t = {q_t}-\underbrace{({r_t}+\gamma \cdot {q_{t+1}})}_{TD \\\ target}$
Derivation of value network ：$d_{w,t} = \frac{\partial q({s_t},{a_t};w)}{\partial w}|w=w_t$
This step torch and tensenflow Can be derived automatically .
TD Algorithm Update the value network , Let the referee score more accurately ：$w_{t+1}=w_t - \alpha \cdot \delta_t\cdot d_{w,t}$
For Strategic Networks $$\pi$$ Derivation ：$d_{\theta,t}=\frac{\partial log \pi({a_t}|{s_t},\theta)}{\partial \theta}|\theta=\theta_t$
Empathy , It can be derived automatically
Update the policy network with gradient rise , Let the average score of athletes be higher ：$\theta_{t+1} = \theta_t + \beta \cdot {q_t} \cdot d_{\theta,t}$, here ${q_t} \cdot d_{\theta,t}$ It's strategic gradient Monte Carlo approximation .

Do the above in each iteration 9 A step , And make an action , Observe a reward , Update the neural network parameters once .

According to the strategy gradient algorithm , The first algorithm 9 Step used ${q_t}$, It is the score given by the referee , Books and papers usually take ${\delta_t}$ To replace ${q_t}$.${q_t}$ It's a standard algorithm ,${\delta_t}$ yes Policy Gradient With Baseline( better ), Are all right , The calculated expectation is also equal .

Baseline What is it? ？ near $q_t$ The number of can be used as Baseline, But it can't be $a_t$ Function of .
As for why baseline better , Because the variance can be calculated better , Faster convergence .
The equivalence here will be discussed later . Let's first understand .

4.3 summary

Our goal is ： State value function ：$ V_\pi({s})=\sum_{{a}}\pi({a}|{s})\cdot Q_\pi({s},{a})$, The bigger the better

But learn directly $\pi$ Functions are not easy , Using neural networks - Policy network $\pi({s}|{a};\theta)$ To approximate
Calculation Policy gradient There is a difficulty when I don't know the action value function $Q_\pi$, So use neural networks - Value network $q({s},{a};w)$ To approximate .

During the training ：

agent from Policy network (actor) Give the action $a_t \sim \pi(\cdot|{s_t};\theta)$
Value network q Auxiliary training $\pi$, Give a score

After training ：

Or by Policy network Give the action $a_t \sim \pi(\cdot|{s_t};\theta)$
Value network q No longer use

How to train ：

Update policy network with policy gradient ：

Maximize status value ：$ V_\pi({s})=\sum_a\pi(a|s;\theta)\cdot q(s,a;w)$
Calculate the strategy gradient , Calculate with Monte Carlo ： $\frac{\partial V(s;\theta;w_t)}{\partial \theta} = \mathbb{E}_{A}[\frac{\partial \\\ log \\\ \pi({a}|{s};\theta)}{\partial \theta} \cdot q({s},{a};w)]$
Perform gradient rise .

TD The algorithm updates the value network

$q_t = q(s_t,a_t;w)$ Is the value network is an estimate of the expected return ;
TD target：$y_t = r_t + \gamma \cdot \mathop{max}\limits_{a} q(s_{t+1},a_{t+1};w)$,$y_t$ Also, the value network is an estimate of the expected return , But it uses real rewards , So it's more reliable , So take it as target, Equivalent to labels in machine learning .
hold $q_t And y_t$ The square of the difference is used as the loss function to calculate the gradient ：
$\frac{\partial(q_t-y_t)^2/2}{\partial w} = (q_t-y_t)\cdot\frac{\partial q(s_t,a_t;w)}{\partial w}$
gradient descent , narrow $q_t And y_t$ disparity .