当前位置:网站首页>2、TD+Learning
2、TD+Learning
2022-07-08 01:26:00 【C--G】
Discounted Return
Sarsa
TD Algorithm , Used to learn the action value function QΠ
Sarsa:Tabular Version
Sarsa’s Name
Table status Sarsa Applicable to less States and actions , As the state and action increase , It is difficult to learn when the table is enlarged
Sarsa:Neural Network Version
Q-Learning
TD Algorithm , Learn the optimal action Algorithm
Sarsa And Q-Learning
Derive TD Target
Q-Learning(tabular version)
Q-Learning(DQN Version)
Multi-Setp TD Target
- Using One Reward
- Using Multiple Rewards
Value playback (Revisiting DQN and TD Learning)
- Shortcoming 1:Waste of Experience
- Shortcoming2:Correlated Updates
- Experience playback
- History
Prioritized Experience Replay
On the left is a common scene of Mario , On the right is boos Off scene , Relative to the left , The right side is more rare , Therefore, we should increase the weight of the scene on the right ,TD error The bigger it is , Then the more important the scene is
The learning rate of random gradient descent should be adjusted according to the importance of sampling
Of a sample TD The bigger it is , Then the greater the sampling weight , The lower the learning rate
Overestimation problem
Bootstrapping: Bootstrap problem , Pull your shoes and lift yourself up
Similar to the method of stepping on the right foot with the left foot , It doesn't exist in reality , There exist in reinforcement learning
Problem of Overestimation
- Reason 1:Maximization
- Reason 2:Bootstrapping
- Why does overestimation happen
- Why overestimation is a shortcoming
- Solutions
Target Network
TD Learning with Target Network
Update Target Network
Comparisons
Target Network Although a little better , But we still cannot get rid of the problem of overestimation
Double DQN
Naive Update
Using Target Network
Double DQN
Why does Double DQN work better
Dueling Network
Advantage Function( Dominance function )
Value Functions
Optimal Value Functions
Properties of Advantage Function
Dueling Network
Revisiting DQN
Approximating Advantage Function
Approximating State-Value Function
Dueling Network:Formulation
Blue plus red and then subtract the maximum value of red to get purple finally Dueling Network Output
Problem of Non-identifiability
边栏推荐
- 4、策略學習
- Taiwan Xinchuang sss1700 latest Chinese specification | sss1700 latest Chinese specification | sss1700datasheet Chinese explanation
- Guojingxin center "APEC education +" Shanghai Jiaotong University Japan Cooperation Center x Fudan philosophy class "Zhe Yi" 2022 New Year greetings
- Scheme selection and scheme design of multifunctional docking station for type C to VGA HDMI audio and video launched by ange in Taiwan | scheme selection and scheme explanation of usb-c to VGA HDMI c
- Call (import) in Jupiter notebook ipynb . Py file
- [deep learning] AI one click to change the sky
- Redis 主从复制
- C# ?,?.,?? .....
- 2022 safety officer-a certificate free examination questions and safety officer-a certificate mock examination
- Understanding of expectation, variance, covariance and correlation coefficient
猜你喜欢
2022 examination for safety production management personnel of hazardous chemical production units and new version of examination questions for safety production management personnel of hazardous chem
2021 tea master (primary) examination materials and tea master (primary) simulation test questions
On the concept and application of filtering in radar signal processing
[deep learning] AI one click to change the sky
Taiwan Xinchuang sss1700 latest Chinese specification | sss1700 latest Chinese specification | sss1700datasheet Chinese explanation
Understanding of expectation, variance, covariance and correlation coefficient
Micro rabbit gets a field of API interface JSON
Scheme selection and scheme design of multifunctional docking station for type C to VGA HDMI audio and video launched by ange in Taiwan | scheme selection and scheme explanation of usb-c to VGA HDMI c
How to write mark down on vscode
The Ministry of housing and urban rural development officially issued the technical standard for urban information model (CIM) basic platform, which will be implemented from June 1
随机推荐
2022 examination for safety production management personnel of hazardous chemical production units and new version of examination questions for safety production management personnel of hazardous chem
npm 內部拆分模塊
Four digit nixie tube display multi digit timing
Multi purpose signal modulation generation system based on environmental optical signal detection and user-defined signal rules
Matlab code on error analysis (MAE, MAPE, RMSE)
High quality USB sound card / audio chip sss1700 | sss1700 design 96 kHz 24 bit sampling rate USB headset microphone scheme | sss1700 Chinese design scheme explanation
4、策略學習
Led serial communication
After modifying the background of jupyter notebook and adding jupyterthemes, enter 'JT -l' and the error 'JT' is not an internal or external command, nor a runnable program
Macro definition and multiple parameters
Share a latex online editor | with latex common templates
STM32GPIO口的工作原理
Y59. Chapter III kubernetes from entry to proficiency - continuous integration and deployment (III, II)
Definition and classification of energy
6. Dropout application
Design method and reference circuit of type C to hdmi+ PD + BB + usb3.1 hub (rj45/cf/tf/ sd/ multi port usb3.1 type-A) multifunctional expansion dock
5. Over fitting, dropout, regularization
break algorithm---刷题map
Gnuradio3.9.4 create OOT module instances
Common operations of numpy on two-dimensional array