当前位置:网站首页>2、TD+Learning

2、TD+Learning

2022-07-08 01:26:00 C--G

Discounted Return

 Insert picture description here
 Insert picture description here
 Insert picture description here

 Insert picture description here
 Insert picture description here

 Insert picture description here

Sarsa

TD Algorithm , Used to learn the action value function QΠ

Sarsa:Tabular Version

 Insert picture description here
 Insert picture description here
Sarsa’s Name
 Insert picture description here
Table status Sarsa Applicable to less States and actions , As the state and action increase , It is difficult to learn when the table is enlarged

Sarsa:Neural Network Version

 Insert picture description here

 Insert picture description here
 Insert picture description here
 Insert picture description here

Q-Learning

TD Algorithm , Learn the optimal action Algorithm

Sarsa And Q-Learning
 Insert picture description here
 Insert picture description here

Derive TD Target

 Insert picture description here
 Insert picture description here
 Insert picture description here

Q-Learning(tabular version)

 Insert picture description here

Q-Learning(DQN Version)

 Insert picture description here
 Insert picture description here
 Insert picture description here
 Insert picture description here

Multi-Setp TD Target

  • Using One Reward
     Insert picture description here
  • Using Multiple Rewards
     Insert picture description here
     Insert picture description here
     Insert picture description here
     Insert picture description here
     Insert picture description here

Value playback (Revisiting DQN and TD Learning)

  • Shortcoming 1:Waste of Experience

 Insert picture description here

  • Shortcoming2:Correlated Updates
     Insert picture description here
  • Experience playback

 Insert picture description here
 Insert picture description here
 Insert picture description here
 Insert picture description here

  • History

 Insert picture description here

Prioritized Experience Replay

 Insert picture description here
 Insert picture description here
 Insert picture description here
On the left is a common scene of Mario , On the right is boos Off scene , Relative to the left , The right side is more rare , Therefore, we should increase the weight of the scene on the right ,TD error The bigger it is , Then the more important the scene is
 Insert picture description here
 Insert picture description here
The learning rate of random gradient descent should be adjusted according to the importance of sampling
 Insert picture description here
 Insert picture description here
 Insert picture description here
Of a sample TD The bigger it is , Then the greater the sampling weight , The lower the learning rate

Overestimation problem

 Insert picture description here
Bootstrapping: Bootstrap problem , Pull your shoes and lift yourself up
Similar to the method of stepping on the right foot with the left foot , It doesn't exist in reality , There exist in reinforcement learning
 Insert picture description here
 Insert picture description here

Problem of Overestimation

 Insert picture description here

  • Reason 1:Maximization
     Insert picture description here
     Insert picture description here
     Insert picture description here
  • Reason 2:Bootstrapping
     Insert picture description here
  • Why does overestimation happen
     Insert picture description here

 Insert picture description here
 Insert picture description here
 Insert picture description here

  • Why overestimation is a shortcoming
     Insert picture description here
     Insert picture description here
     Insert picture description here
     Insert picture description here
  • Solutions
     Insert picture description here

Target Network

 Insert picture description here
TD Learning with Target Network
 Insert picture description here
Update Target Network Insert picture description here
Comparisons
 Insert picture description here
Target Network Although a little better , But we still cannot get rid of the problem of overestimation

Double DQN

  • Naive Update
     Insert picture description here

  • Using Target Network
     Insert picture description here

  • Double DQN
     Insert picture description here

  • Why does Double DQN work better
     Insert picture description here

Dueling Network

Advantage Function( Dominance function )

  • Value Functions
     Insert picture description here

  • Optimal Value Functions
     Insert picture description here
    Properties of Advantage Function
     Insert picture description here
     Insert picture description here

Dueling Network

 Insert picture description here
Revisiting DQN
 Insert picture description here
Approximating Advantage Function
 Insert picture description here
Approximating State-Value Function
 Insert picture description here
Dueling Network:Formulation
 Insert picture description here
 Insert picture description here
 Insert picture description here
Blue plus red and then subtract the maximum value of red to get purple finally Dueling Network Output
 Insert picture description here
Problem of Non-identifiability
 Insert picture description here
 Insert picture description here
 Insert picture description here
 Insert picture description here

原网站

版权声明
本文为[C--G]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/189/202207072320355748.html