当前位置:网站首页>Reinforcement learning - learning notes 2 | value learning

Reinforcement learning - learning notes 2 | value learning

2022-07-04 20:40:00 climerecho

Value-Based Reinforcement Learning : Value learning

2. Value learning

2.1 Deep Q-Network DQN

In fact, it is approximated by a neural network \(Q*\) function .

agent Our goal is to win the game , If we use the language of reinforcement learning , Is the sum of the rewards you get at the end of the game Rewards The bigger the better .

a. Q-star Function

problem : Suppose you know \(Q^*(s,a)\) function , Which is the best action ?

obviously , The best action is \(a^* = \mathop{argmax}\limits_{a}Q^*(s,a)\) ,

\(Q^*(s,a)\) You can score each action , Like a prophet , Can tell you the average return of each action , Choose the action with the highest average return .

But the truth is , No one can predict the future , We don't know \(Q^*(s,a)\). And value learning is to learn a function to approximate \(Q^*(s,a)\) Make a decision .

  • solve :Deep Q-network(DQN), That is, a neural network \(Q(s,a;w)\) To approximate \(Q^*(s,a)\) function .
  • The parameters of neural network are w , The input is the State s, Output is the scoring of all possible actions , Each action corresponds to a score .
  • Learn this neural network through rewards , This network will gradually improve the scoring of actions , More and more accurate
  • Play Super Mary millions of times , Can train a prophet .

b. Example

For different cases ,DQN The structure will be different .

If it's super Mary

  • Screen image as input
  • Use a convolution layer to turn the image into a feature vector
  • Finally, several fully connected layers are used to map the features to an output vector
  • The output vector is the score of the action , Each element of the vector corresponds to the score of an action ,agent Will choose the direction with the largest score to act .

c. use DQN Play the game

DQN The specific implementation process is as follows :

Step by step explanation :

  • $ s_t \rightarrow a_t$: Current observed state \(s_t\), Use formula $ a_t=\mathop{argmax}\limits_{a}Q^*(s,a)$ hold \(s_t\) As input , Rate all actions , Choose the action with the highest score \(a_t\) .
  • agent perform \(a_t\) After this action , The environment will change the State , Use the state transition function \(p(\cdot|s_t,a_t)\) Random sampling leads to a new state \(s_{t+1}\).
  • The environment will also tell the rewards of this step \(r_t\) , Reward is the supervisory signal in reinforcement learning ,DQN Train with these rewards .
  • With a new state \(s_{t+1}\),DQN Continue to score all actions ,agent Choose the action with the highest score \(a_{t+1}\).
  • perform \(a_{t+1}\) after , The environment will update another status \(s_{t+2}\), Give a reward \(r_{t+1}\).
  • And then go back and forth , Until the end of the game

2.2 TD Study

How to train DQN ? The most commonly used is Temporal Difference Learning.TD The principle of learning can be demonstrated by the following example :

a. case analysis

Drive from New York to Atlanta , There's a model \(Q(w)\) The predicted cost of driving is 1000 minute . This prediction may be inaccurate , More people are needed to provide data to train models to make predictions more accurate .

  1. problem : What kind of data is needed ? How to update the model .
  • Let the model make a prediction before departure , Write it down as \(q\),\(q=Q(w)\), such as \(q=1000\). Arrived at the destination , I found that only 860 minute , Get the real value \(y=860\).

  • actual value \(y\) And the forecast \(q\) There is a deviation , That's what happened loss Loss

  • loss Defined as the actual value and the predicted value Square difference \(L=\frac{1}{2}(q-y)^2\)

  • On loss \(L\) About parameters \(w\) Take the derivative and expand it with the chain rule :

    \(\frac{\partial L}{\partial w} = \frac{\partial q}{\partial w} \cdot \frac{\partial L}{\partial q} =(q-y)\cdot\frac{\partial Q(w)}{\partial w}\)

  • The gradient is found , Gradient descent can be used to update model parameters w :\(w_{t+1} = w_t - \alpha \cdot \frac{\partial L}{\partial w} \vert _{w=w_t}\)

shortcoming : This algorithm is more naive, Because it takes the whole journey to complete an update of the model .

So here comes the question : If you don't finish the whole trip , Whether the update of the model can be completed ?

It can be used TD Thinking about this matter . For example, we passed by halfway DC Don't go away , Didn't go to Atlanta , It can be used TD The algorithm completes the update of the model , namely

  • Pre departure forecast :NYC -> Atlanta To spend 1000 minute , This is a Predictive value .

    here we are DC when , It's found that 300 minute , This is a True observations , Although it is aimed at the part .

  • At this time, the model tells ,DC -> Atlanta To spend 600 minute .

  • The model originally predicted :$Q(w) = 1000 $, And then DC New forecast :300 + 600 = 900, This new 900 Valuation is called TD target .

    Remember these nouns , It will be used repeatedly later ,TD target Is the use of the TD The moral overall prediction value of the algorithm .

  • TD target \(y=900\) Although it is also an estimated forecast , But more than the original 1000 Minutes are more reliable , Because there are facts .

  • hold TD target \(y\) Just As True value :\(L = \frac{1}{2}(Q(w)-y)^2\), among $Q(w) - y $ be called TD error.

  • Derivation :\(\frac{\partial L}{\partial w} =(1000-900)\cdot\frac{\partial Q(w)}{\partial w}\)

  • Gradient descent updates model parameters w :\(w_{t+1} = w_t - \alpha \cdot \frac{\partial L}{\partial w} \vert _{w=w_t}\)

b. Algorithm principle

Think about it from another angle ,TD This is the process of :

Model to predict NYC -> Atlanta = 1000, DC -> Atlanta = 600, The difference between the two is 400, That is to say NYC -> DC = 400, But it only took 300 The difference between the estimated time in minutes and the real time is TD error:\(\delta = 400-300 = 100\).

TD The goal of the algorithm is to make TD error As close as possible to 0 .

That is, we use part of the truth modify Part of the prediction , And make the overall prediction closer to reality . We can calibrate repeatedly The knowable part is true Come close to our ideal situation .

c. be used for DQN

(1) Formula introduction

There is such a formula in the above example :$T_{NYC\rightarrow ATL} \approx T_{NYC\rightarrow DC} + T_{DC\rightarrow ATL} $

Want to use TD Algorithm , You have to use a formula like this , There is a term on the left side of the equation , There are two items on the right , One of them is actually observed .

And before that , There is also such a formula in deep reinforcement learning :\(Q(s_t,a_t;w)=r_t+\gamma \cdot Q(s_{t+1},a_{t+1};w)\).

Formula explanation :

  1. On the left is DQN stay t Estimates made at all times , This is the expectation of the sum of future rewards , amount to NYC To ATL Estimated total time .
  2. On the right \(r_t\) It is the reward of real observation , amount to NYC To DC .
  3. \(Q(s_{t+1},a_{t+1};w)\) yes DQN stay t+1 Estimates made at all times , amount to DC To ATL The estimated time .
(2) Formula derivation

Why is there such a formula ?

review Discounted return:\(U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\gamma^3 R_{t+3}+\cdots\)

Put forward \(\gamma\) You get $ = R_t + \gamma(R_{t+1}+ \gamma R_{t+2}+ \gamma^2 R_{t+3}+\cdots) $

The latter items can be written as \(U_{t+1}\), namely \(=R_t+\gamma U_{t+1}\)

So you get :\(U_t = R_t + \gamma \cdot U_{t+1}\)

Intuitively speaking , This is the mathematical relationship between two adjacent discount algorithms .

(3) Application process

Now we have to put TD The algorithm uses DQN On

  • t moment DQN Output value \(Q(s_t,a_t; w)\) It's right \(U_t\) Estimates made \(\mathbb{E}[U_t]\), Be similar to NYC To ATL Estimated total time .
  • The next moment DQN Output value \(Q(s_{t+1},a_{t+1}; w)\) It's right \(U_{t+1}\) Estimates made \(\mathbb{E}[U_{t+1}]\), Be similar to DC To ATL The second estimated time of .
  • because \(U_t = R_t + \gamma \cdot U_{t+1}\)
  • therefore \(\underbrace{Q(s_t,a_t; w) }_{\approx\mathbb{E}[U_t]}\approx \mathbb{E}[R_t+\gamma \cdot \underbrace{Q(s_{t+1},a_{t+1}; w)}_{\approx\mathbb{E}[U_{t+1}]}]\)
  • \(\underbrace{Q(s_t,a_t;w)}_{prediction}=\underbrace{r_t+\gamma \cdot Q(s_{t+1},a_{t+1};w)}_{TD\ \ target}\)

With prediction and TD target , You can update DQN The model parameters of .

  • t Time model makes predictions \(Q(s_t,a_t;w_t)\);

  • here we are t+1 moment , Observed real rewards \(r_t\) And the new state \(s_{t+1}\), Then calculate the new action \(a_{t+1}\).

  • At this time, we can calculate TD target Write it down as \(y_t\), among \(y_t=r_t+\gamma \cdot Q(s_{t+1},a_{t+1};w)\)

  • t+1 The action of the moment \(a_{t+1}\) How to calculate the ?DQN Score each action , Take the highest score , So it's equal to Q Function about a Maximize :$ y_t = r_t + \gamma \cdot \mathop{max}\limits_{a} Q(s_{t+1},a;w_t)$

  • We want to predict \(Q(s_{t},a_{t};w)\) As close as possible TD target \(y_t\) , So we regard the difference between the two as Loss :

    \(L_t=\frac{1}{2}[Q(s_{t},a_{t};w)-y_t]^2\)

  • Do gradient descent :\(w_{t+1} = w_t - \alpha \cdot \frac{\partial L}{\partial w} \vert _{w=w_t}\) Update model parameters w, To make the Loss smaller

2.3 summary

  1. Value learning ( Ben is talking about DQN) Based on the optimal action value function Q-star :

    \(Q^*(s_t,a_t) = \mathbb{E}[U_t|S_t=s_t,A_t=a_t]\)

    Yes \(U_t\) Expect , Can score each action , Reflect the quality of each action , Use this function to control agent.

  2. DQN Is to use a neural network \(Q(s,a;w)\) To approximate $Q^*(s,a) $

    1. The parameters of neural network are w , Input is agent The state of s
    2. The output is for all possible actions $ a \in A$ Of
  3. TD The algorithm process

    1. Observe the current state $S_t = s_t $ And actions that have been performed \(A_t = a_t\)

    2. use DQN Do a calculation , The input is the State \(s_t\), Output is for action \(a_t\) Of

      Write it down as \(q_t\),\(q_t = Q(s_t,a_t;w)\)

    3. Back propagation is right DQN Derivation :\(d_t = \frac{\partial Q(s_t,a_t;w)}{\partial w} |_{ w=w_t}\)

    4. Due to the execution of the action \(a_t\), The environment will update to \(s_{t+1}\), And give rewards \(r_t\).

    5. Find out TD target:\(y_t = r_t + \gamma \cdot \mathop{max}\limits_{a} Q(s_{t+1},a_t;w)\)

    6. Do a gradient descent to update the parameters w , \(w_{t+1} = w_t - \alpha \cdot (q_t-y_t) \cdot d_t\)

    7. Update the iteration ...

x. Reference tutorial

原网站

版权声明
本文为[climerecho]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/185/202207041854464751.html

随机推荐