当前位置：网站首页>2、TD+Learning

2、TD+Learning

2022-07-08 01:26:00 【C--G】

Discounted Return

Insert picture description here

Sarsa

TD Algorithm , Used to learn the action value function QΠ

Sarsa：Tabular Version

Insert picture description here

Sarsa’s Name

Table status Sarsa Applicable to less States and actions , As the state and action increase , It is difficult to learn when the table is enlarged

Sarsa：Neural Network Version

Insert picture description here

Q-Learning

TD Algorithm , Learn the optimal action Algorithm

Sarsa And Q-Learning
Insert picture description here

Derive TD Target

Insert picture description here

Q-Learning(tabular version)

Insert picture description here

Q-Learning(DQN Version)

Insert picture description here

Multi-Setp TD Target

Using One Reward
Using Multiple Rewards

Value playback （Revisiting DQN and TD Learning）

Shortcoming 1：Waste of Experience

Insert picture description here

Shortcoming2：Correlated Updates
Experience playback

Insert picture description here

History

Insert picture description here

Prioritized Experience Replay

Insert picture description here

On the left is a common scene of Mario , On the right is boos Off scene , Relative to the left , The right side is more rare , Therefore, we should increase the weight of the scene on the right ,TD error The bigger it is , Then the more important the scene is

The learning rate of random gradient descent should be adjusted according to the importance of sampling
Insert picture description here

Of a sample TD The bigger it is , Then the greater the sampling weight , The lower the learning rate

Overestimation problem

Insert picture description here
Bootstrapping： Bootstrap problem , Pull your shoes and lift yourself up
Similar to the method of stepping on the right foot with the left foot , It doesn't exist in reality , There exist in reinforcement learning

Problem of Overestimation

Insert picture description here

Reason 1:Maximization
Reason 2:Bootstrapping
Why does overestimation happen

Insert picture description here

Why overestimation is a shortcoming
Solutions

Target Network

Insert picture description here
TD Learning with Target Network

Update Target Network
Comparisons

Target Network Although a little better , But we still cannot get rid of the problem of overestimation