当前位置:网站首页>2、TD+Learning
2、TD+Learning
2022-07-07 23:21:00 【C--G】
Discounted Return
Sarsa
TD算法,用来学习动作价值函数QΠ
Sarsa:Tabular Version
Sarsa’s Name
表格状态的Sarsa适用于状态和动作较少,随着状态和动作的增大,表格增大就很难学习
Sarsa:Neural Network Version
Q-Learning
TD算法,学习最优动作算法
Sarsa与Q-Learning
Derive TD Target
Q-Learning(tabular version)
Q-Learning(DQN Version)
Multi-Setp TD Target
- Using One Reward
- Using Multiple Rewards
价值回放(Revisiting DQN and TD Learning)
- Shortcoming 1:Waste of Experience
- Shortcoming2:Correlated Updates
- 经验回放
- History
Prioritized Experience Replay
左边是马里奥常见场景,右边是boos关场景,相对于左边而言,右边更少见,因此要加大右边场景的权重,TD error越大,那么该场景就越重要
随机梯度下降的学习率应该根据抽样的重要性进行调整
一条样本的TD越大,那么抽样权重就越大,学习率就越小
高估问题
Bootstrapping:自举问题,拽自己的鞋子将自己提起来
类似左脚踩右脚上天方法,现实中是不存在,强化学习中存在
Problem of Overestimation
- Reason 1:Maximization
- Reason 2:Bootstrapping
- Why does overestimation happen
- Why overestimation is a shortcoming
- Solutions
Target Network
TD Learning with Target Network
Update Target Network
Comparisons
Target Network虽然好了一点,但仍然无法摆脱高估问题
Double DQN
Naive Update
Using Target Network
Double DQN
Why does Double DQN work better
Dueling Network
Advantage Function(优势函数)
Value Functions
Optimal Value Functions
Properties of Advantage Function
Dueling Network
Revisiting DQN
Approximating Advantage Function
Approximating State-Value Function
Dueling Network:Formulation
蓝色加上红色再减去红色的最大值就得到紫色最后Dueling Network输出
Problem of Non-identifiability
边栏推荐
- General configuration toolbox
- 跨模态语义关联对齐检索-图像文本匹配(Image-Text Matching)
- Chapter IV decision tree
- 130. Surrounding area
- How does starfish OS enable the value of SFO in the fourth phase of SFO destruction?
- Macro definition and multiple parameters
- Redis 主从复制
- [reprint] solve the problem that CONDA installs pytorch too slowly
- 4.交叉熵
- Markdown learning (entry level)
猜你喜欢
Several frequently used OCR document scanning tools | no watermark | avoid IQ tax
130. Surrounding area
Image data preprocessing
Y59. Chapter III kubernetes from entry to proficiency - continuous integration and deployment (III, II)
Common fault analysis and Countermeasures of using MySQL in go language
The combination of relay and led small night light realizes the control of small night light cycle on and off
AI遮天传 ML-回归分析入门
Taiwan Xinchuang sss1700 latest Chinese specification | sss1700 latest Chinese specification | sss1700datasheet Chinese explanation
6. Dropout application
A network composed of three convolution layers completes the image classification task of cifar10 data set
随机推荐
Ag9311maq design 100W USB type C docking station data | ag9311maq is used for 100W USB type C to HDMI with PD fast charging +u3+sd/cf docking station scheme description
Frrouting BGP protocol learning
USB type-C docking design | design USB type-C docking scheme | USB type-C docking circuit reference
Ag9310 same function alternative | cs5261 replaces ag9310type-c to HDMI single switch screen alternative | low BOM replaces ag9310 design
Smart grid overview
Saving and reading of network model
英雄联盟胜负预测--简易肯德基上校
Mathematical modeling -- knowledge map
From starfish OS' continued deflationary consumption of SFO, the value of SFO in the long run
Codeforces Round #804 (Div. 2)
A network composed of three convolution layers completes the image classification task of cifar10 data set
Connect to the previous chapter of the circuit to improve the material draft
Ag9310meq ag9310mfq angle two USB type C to HDMI audio and video data conversion function chips parameter difference and design circuit reference
Know how to get the traffic password
9.卷积神经网络介绍
Binder core API
Introduction to paddle - using lenet to realize image classification method II in MNIST
Solve the error: NPM warn config global ` --global`, `--local` are deprecated Use `--location=global` instead.
解决报错:npm WARN config global `--global`, `--local` are deprecated. Use `--location=global` instead.
9. Introduction to convolutional neural network