当前位置:网站首页>Model-Free Prediction
Model-Free Prediction
2022-07-07 00:26:00 【Evergreen AAS】
The last article introduced Model-based The general method of —— Dynamic programming , The content of this paper Model-Free Under the circumstances Prediction problem , namely “Estimate the value function of an unknown MDP”.
- Model-based:MDP It is known that , That is, the transfer matrix and reward function are known
- Model-Free:MDP Unknown
Monte Carlo learning
Monte Carlo method (Monte-Carlo Methods, abbreviation MC) Also called Monte Carlo simulation , It means using random numbers ( Or more common pseudo-random numbers ) To solve a lot of computational problems . In fact, the essence is , Produce a posterior by acting as randomly as possible , Then the target system is characterized by a posteriori .
stay Model-Free Under the circumstances ,MC The application of reinforcement learning is to obtain the value function , Its characteristics are as follows :
- MC From the complete episodes Middle school learning (no bootstrapping)
- MC Calculate the value by the mean , namely value = mean(return)
- MC Only applicable to episodic MDPs( Co., LTD. MDPs)
First-Visit Monte Carlo strategy evaluation
First-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π~(s)) when , In a episode in , state s At the moment t The first time I was interviewed , Counter N(s)←N(s)+1 , Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s) = \frac{S(s)}{N(s)} According to the law of large numbers (Law of Large Numbers):V(s) → v_{\pi}(s) \text{ as } N(s) → \infty
Every-Visit Monte Carlo strategy evaluation
Every-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π(~s) when , In a episode in , state s At the moment t Every time When interviewed , Counter N(s)←N(s)+1, Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s)=S(s)/N(s) According to the law of large numbers (Law of Large Numbers):V(s)→v~π~(s) as N(s)→∞
Incremental Monte-Carlo
Incremental averaging : The mean μ1,μ2,… of a sequence x1,x2,… . can be computed incrementally:
According to the above formula, we can get incremental MC Updated formula : Every time episode After the end , Incremental Updating V(s) , For each state St , Their corresponding return by Gt :
N(S_t) ← N(S_t) + 1 \\ V(S_t) ← V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t))
In non static problems , The form of the update formula can be changed as follows :
V(S_t) ← V(S_t) + \alpha (G_t - V(S_t))
Timing difference learning
Time series difference method (Temporal-Difference Methods, abbreviation TD) characteristic :
- TD Can pass bootstrapping From incomplete episodes Middle school learning
- TD updates a guess towards a guess
TD(λ)
The following figure for TD target In different n Diagram below :
As can be seen from the above figure , When n When the termination is reached , It's a episode, At this time, the corresponding method is MC, So from this point of view ,MC Belong to TD In special circumstances .
n-step Return
n-step returns It can be expressed as follows :
n=1 when :G_{t}^{(1)} = R_{t+1} + \gamma V(S_{t+1})
n=2 when :G_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})…
n=∞ when :G_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1} R_T)
therefore ,n-step return G_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{n}V(S_{t+n})
n-step TD Update formula :
V(S_t) ← V(S_t) + \alpha (G_t^{(n)} - V(S_t))
Forward View of TD(λ)
Can we put all n-step return combined ? The answer must be yes , After the combination of return Known as yes λ-return, among λ To combine different n-step returns The weight factor introduced .
λ-return:
G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}
Forward-view TD(λλ):
V(S_t) ← V(S_t) + \alpha\Bigl(G_t^{\lambda} - V(S_t)\Bigr)
TD(λ) The corresponding weight formula is (1−λ)λ^n−1^, The distribution diagram is as follows :
Forward-view TD(λ) Characteristics :
- Update value function towards the λ-return
- Forward-view looks into the future to compute GλtGtλ
- Like MC, can only be computed from complete episodes
Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
With qualification marks TD(λλ):
among δt by TD-error,Et(s) For qualification trace .
Qualification trail (Eligibility Traces)
The essence of qualification trace is for high frequency , The recent state gives higher trust (credit)/ The weight .
The following figure is a description of the qualification track :
About TD(λ) There is a conclusion :
The sum of offline updates is identical for forward-view and backward-view TD(λ).
The content of this section will not be introduced in depth , Those who are interested can read it Sutton Books and David A tutorial for .
Monte Carlo learning vs. Timing difference learning
MC And TD Similarities and differences
The same thing : They all learn given strategies online from experience π The value function of v~π~
Difference :
- Incremental every-visit Monte-Carlo: Towards the real return G~t~ to update V(S~t~)
V(S_t) ← V(S_t) + \alpha (\textcolor{Red}{G_t} - V(S_t))
- Simplest temporal-difference learning algorithm: TD(0)
- Towards the estimated return \color{Red}{R_{t+1} + \gamma V(S_{t+1})} to update V(S~t~)
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})} It's called TD target
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})}−V(S_{t}) It's called TD error
In the figure below Drive Home Give an example to illustrate the difference between the two ,MC The prediction of the time of returning home can only be changed after returning home , and TD In each step, constantly adjust your prediction according to the actual situation .
MC And TD Advantages and disadvantages
Learning style
- TD You can learn before you know the final result ( As an example above )
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD You can learn without the final result ( Such as infinite / continuity MDPs)
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Variance and bias
- MC has high variance, zero bias( High variance , Zero deviation )
- Good convergence properties
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias( Low variance , There is a certain deviation )
- Usually more efficient than MC
- TD(0) converges to vπ(s)vπ(s)
- More sensitive to initial value
About MC and TD Explanation of variance and deviation in :
- MC The update is based on real return G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1}R_{T} yes v~π~(St) Unbiased estimation of .
- Actual TD target R_{t+1} + \gamma v_{\pi}(S_{t+1}) It's also vπ(St) Unbiased estimation of . But it is used in the actual update TD target R_{t+1} + \gamma V(S_{t+1}) yes vπ(St) The biased estimate of .
- TD target With lower deviation :
- Return Each simulation depends on many random actions 、 Transfer probability and return
- TD target Rely on only one random action at a time 、 Transfer probability and return
Markov nature
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
DP、MC as well as TD(0)
First we start with backup tree Go up and intuitively understand the differences between the three .
- DP backup tree:Full-Width step( complete step)
- MC backup tree: complete episode
- TD(0) backup tree: Single step
Bootstrapping vs. Sampling
Bootstrapping: Update based on the predicted value
- DP bootstraps
- MC does not bootstrap
- TD bootstraps
Sampling: Update based on sampling expectations
- DP does not sample(model-based methods don’t need sample)
- MC samples(model-free methods need sample)
- TD samples(model-free methods need sample)
The figure below shows RL The difference between several basic methods :
边栏推荐
- 互动滑轨屏演示能为企业展厅带来什么
- Interesting wine culture
- After leaving a foreign company, I know what respect and compliance are
- 【CVPR 2022】半监督目标检测:Dense Learning based Semi-Supervised Object Detection
- 2022/2/12 summary
- ldap创建公司组织、人员
- 1000 words selected - interface test basis
- JS import excel & Export Excel
- PDF文档签名指南
- Pdf document signature Guide
猜你喜欢

Clipboard management tool paste Chinese version

37 page overall planning and construction plan for digital Village revitalization of smart agriculture

2022/2/12 summary

pytest多进程/多线程执行测试用例

三维扫描体数据的VTK体绘制程序设计

从外企离开,我才知道什么叫尊重跟合规…

AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06

Interface joint debugging test script optimization v4.0

互动滑轨屏演示能为企业展厅带来什么

509 certificat basé sur Go
随机推荐
Jenkins' user credentials plug-in installation
陀螺仪的工作原理
iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
js导入excel&导出excel
Three sentences to briefly introduce subnet mask
2021 SASE integration strategic roadmap (I)
PDF文档签名指南
rancher集成ldap,实现统一账号登录
《LaTex》LaTex数学公式简介「建议收藏」
Understand the misunderstanding of programmers: Chinese programmers in the eyes of Western programmers
Everyone is always talking about EQ, so what is EQ?
509 certificat basé sur Go
2022/2/10 summary
[CVPR 2022] semi supervised object detection: dense learning based semi supervised object detection
Geo data mining (III) enrichment analysis of go and KEGG using David database
AVL树到底是什么?
48页数字政府智慧政务一网通办解决方案
工程师如何对待开源 --- 一个老工程师的肺腑之言
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
Random类的那些事