当前位置:网站首页>2、TD+Learning
2、TD+Learning
2022-07-08 01:26:00 【C--G】
Discounted Return
Sarsa
TD Algorithm , Used to learn the action value function QΠ
Sarsa:Tabular Version
Sarsa’s Name
Table status Sarsa Applicable to less States and actions , As the state and action increase , It is difficult to learn when the table is enlarged
Sarsa:Neural Network Version
Q-Learning
TD Algorithm , Learn the optimal action Algorithm
Sarsa And Q-Learning
Derive TD Target
Q-Learning(tabular version)
Q-Learning(DQN Version)
Multi-Setp TD Target
- Using One Reward
- Using Multiple Rewards
Value playback (Revisiting DQN and TD Learning)
- Shortcoming 1:Waste of Experience
- Shortcoming2:Correlated Updates
- Experience playback
- History
Prioritized Experience Replay
On the left is a common scene of Mario , On the right is boos Off scene , Relative to the left , The right side is more rare , Therefore, we should increase the weight of the scene on the right ,TD error The bigger it is , Then the more important the scene is
The learning rate of random gradient descent should be adjusted according to the importance of sampling
Of a sample TD The bigger it is , Then the greater the sampling weight , The lower the learning rate
Overestimation problem
Bootstrapping: Bootstrap problem , Pull your shoes and lift yourself up
Similar to the method of stepping on the right foot with the left foot , It doesn't exist in reality , There exist in reinforcement learning
Problem of Overestimation
- Reason 1:Maximization
- Reason 2:Bootstrapping
- Why does overestimation happen
- Why overestimation is a shortcoming
- Solutions
Target Network
TD Learning with Target Network
Update Target Network
Comparisons
Target Network Although a little better , But we still cannot get rid of the problem of overestimation
Double DQN
Naive Update
Using Target Network
Double DQN
Why does Double DQN work better
Dueling Network
Advantage Function( Dominance function )
Value Functions
Optimal Value Functions
Properties of Advantage Function
Dueling Network
Revisiting DQN
Approximating Advantage Function
Approximating State-Value Function
Dueling Network:Formulation
Blue plus red and then subtract the maximum value of red to get purple finally Dueling Network Output
Problem of Non-identifiability
边栏推荐
- break algorithm---刷题map
- Gnuradio 3.9 using OOT custom module problem record
- 2. Nonlinear regression
- Swift get URL parameters
- 50MHz generation time
- Understanding of maximum likelihood estimation
- 2022 high voltage electrician examination skills and high voltage electrician reexamination examination
- Several frequently used OCR document scanning tools | no watermark | avoid IQ tax
- Design method and application of ag9311maq and ag9311mcq in USB type-C docking station or converter
- General configuration toolbox
猜你喜欢
2022 operation certificate examination for main principals of hazardous chemical business units and main principals of hazardous chemical business units
Use "recombined netlist" to automatically activate eco "APR netlist"
Chapter VIII integrated learning
Ag9311maq design 100W USB type C docking station data | ag9311maq is used for 100W USB type C to HDMI with PD fast charging +u3+sd/cf docking station scheme description
Transportation, new infrastructure and smart highway
Gnuradio3.9.4 create OOT module instances
The communication clock (electronic time-frequency or electronic time-frequency auxiliary device) writes something casually
Redis集群
Micro rabbit gets a field of API interface JSON
Redis master-slave replication
随机推荐
Guojingxin center "friendship and righteousness" - the meta universe based on friendship and friendship, and the parallel of "honguniverse"
50MHz generation time
Design method and reference circuit of type C to hdmi+ PD + BB + usb3.1 hub (rj45/cf/tf/ sd/ multi port usb3.1 type-A) multifunctional expansion dock
Frequency probability and Bayesian probability
2022 examination for safety production management personnel of hazardous chemical production units and new version of examination questions for safety production management personnel of hazardous chem
Cross modal semantic association alignment retrieval - image text matching
Ag7120 and ag7220 explain the driving scheme of HDMI signal extension amplifier | ag7120 and ag7220 design HDMI signal extension amplifier circuit reference
2022 R1 fast opening pressure vessel operation test question bank and R1 fast opening pressure vessel operation free test questions
6. Dropout application
Two methods for full screen adaptation of background pictures, background size: cover; Or (background size: 100% 100%;)
A speed Limited large file transmission tool for every major network disk
The Ministry of housing and urban rural development officially issued the technical standard for urban information model (CIM) basic platform, which will be implemented from June 1
Y59. Chapter III kubernetes from entry to proficiency - continuous integration and deployment (III, II)
2021 Shanghai safety officer C certificate examination registration and analysis of Shanghai safety officer C certificate search
break algorithm---刷题map
EDP to LVDS conversion design circuit | EDP to LVDS adapter board circuit | capstone/cs5211 chip circuit schematic reference
Understanding of maximum likelihood estimation
Continued from the previous design
2021-04-12 - new features lambda expression and function functional interface programming
The beauty of Mathematics -- the principle of fine Fourier transform