当前位置:网站首页>Reinforcement learning - learning notes 9 | multi step TD target

Reinforcement learning - learning notes 9 | multi step TD target

2022-07-07 21:48:00 climerecho

This note still belongs to TD Category of algorithm .Multi-Step-TD-Target It's right TD Algorithm improvement .

9. Multi-Step-TD-Target

9.1 Review Sarsa & Q-Learning

  1. Sarsa
    • Training Action value function \(Q_\pi(s,a)\);
    • TD Target yes \(y_t = r_t + \gamma\cdot Q_\pi(s_{t+1},a_{t+1})\)
  2. Q-Learning
    • Training The optimal Action value function Q-star;
    • TD Target yes \(y_t = r_t +\gamma \cdot \mathop{max}\limits_{a} Q^*({s_{t+1}},a)\)
  3. Be careful , Of the two algorithms TD Target Of r part It's just One Reward \(r_t\)
  4. If you use multiple rewards , that RL The effect will be better ;Multi-Step-TD-Target Based on this consideration .

In the first chapter of reinforcement learning Basic concepts in , Just mentioned ,agent The following trajectory will be observed :

image

We only used one before transition To record the action 、 Reward , And update TD-Target. One transition Include \((s_t,a_t,s_{t+1},r_t)\), There is only one reward \(r_t\).( As shown in the blue box above ).

That's how it works out TD Target Namely One Step TD Target.

In fact, we can also use more than one at a time transition The reward in the game , Got TD Target Namely Multi-Step-TD-Target. As shown in the following figure, two are selected in the blue box transition, Similarly, you can choose the last two transition .

image

9.2 Multi step discount return

Multi-Step Return.

The discount return formula is :\(U_t=R_t+\gamma\cdot{U_{t+1}}\);

This formula establishes t Time and t+1 The moment U The relationship between , In order to get a multi-step discount return , We use this formula recursively :

\(U_t=R_t+\gamma\cdot{U_{t+1}}\\=R_t+\gamma\cdot(R_{t+1}+\gamma\cdot{U_{t+2}})\\=R_t+\gamma\cdot{R_{t+1}}+\gamma^2\cdot{U_{t+2}}\)

such , We can include Two rewards , Similarly, we can have Three rewards ...... Recursion , contain m Awards are :

\(U_t=\sum_{i=0}^{m-1}\gamma^i\cdot{R_{t+i}}+\gamma^m\cdot{U_{t+m}}\)

namely : Return \(U_t\) be equal to m Weighted sum of awards , Plus \(\gamma^m\cdot{U_{t+m}}\), The latter item is called Multi step return .

Now we have launched Multi step \(U_t\) Formula , Further, we can introduce Step by step \(y_t\) Formula , That is, expect both sides of the equation respectively , Make random variables specific

  1. Sarsa Of m-step TD target:

    \(y_t=∑_{i=0}^{m−1}\gamma^i\cdot r_{t+i}+\gamma^m\cdot{Q_\pi}(s_{t+m},a_{t+m})\)

    Be careful :m=1 when , It is the standard we are familiar with before TD Target.

    Multi step TD Target The effect is better than Single step good .

  2. Q-Learning Of m-step TD target:

    \(y_t = \sum_{i=0}^{m-1}\gamma^i{r_{t+i}}+\gamma^m\cdot\mathop{max}\limits_{a} Q^*({s_{t+m}},a)\)

    Again ,m=1 when , That's what happened before TD Target.

9.3 Single step And Step by step Comparison of

  • Single step TD Target in , Use only one reward \(r_t\);

  • If you use multiple steps TD Target, Will use multiple rewards :\(r_t,r_{t+1},...,r_{t+m-1}\)

    Think about the second Value learning An example of a journey , If the proportion of the actual journey is higher , Don't consider “ cost ” Under the circumstances , The estimation of travel time will be more reliable .

  • m It's a super parameter , It needs to be adjusted manually , If adjusted properly , The effect will be much better .

x. Reference tutorial

原网站

版权声明
本文为[climerecho]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207072044463380.html