当前位置:网站首页>Model-Free Prediction
Model-Free Prediction
2022-07-07 00:26:00 【Evergreen AAS】
The last article introduced Model-based The general method of —— Dynamic programming , The content of this paper Model-Free Under the circumstances Prediction problem , namely “Estimate the value function of an unknown MDP”.
- Model-based:MDP It is known that , That is, the transfer matrix and reward function are known
- Model-Free:MDP Unknown
Monte Carlo learning
Monte Carlo method (Monte-Carlo Methods, abbreviation MC) Also called Monte Carlo simulation , It means using random numbers ( Or more common pseudo-random numbers ) To solve a lot of computational problems . In fact, the essence is , Produce a posterior by acting as randomly as possible , Then the target system is characterized by a posteriori .
stay Model-Free Under the circumstances ,MC The application of reinforcement learning is to obtain the value function , Its characteristics are as follows :
- MC From the complete episodes Middle school learning (no bootstrapping)
- MC Calculate the value by the mean , namely value = mean(return)
- MC Only applicable to episodic MDPs( Co., LTD. MDPs)
First-Visit Monte Carlo strategy evaluation
First-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π~(s)) when , In a episode in , state s At the moment t The first time I was interviewed , Counter N(s)←N(s)+1 , Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s) = \frac{S(s)}{N(s)} According to the law of large numbers (Law of Large Numbers):V(s) → v_{\pi}(s) \text{ as } N(s) → \infty
Every-Visit Monte Carlo strategy evaluation
Every-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π(~s) when , In a episode in , state s At the moment t Every time When interviewed , Counter N(s)←N(s)+1, Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s)=S(s)/N(s) According to the law of large numbers (Law of Large Numbers):V(s)→v~π~(s) as N(s)→∞
Incremental Monte-Carlo
Incremental averaging : The mean μ1,μ2,… of a sequence x1,x2,… . can be computed incrementally:
According to the above formula, we can get incremental MC Updated formula : Every time episode After the end , Incremental Updating V(s) , For each state St , Their corresponding return by Gt :
N(S_t) ← N(S_t) + 1 \\ V(S_t) ← V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t))
In non static problems , The form of the update formula can be changed as follows :
V(S_t) ← V(S_t) + \alpha (G_t - V(S_t))
Timing difference learning
Time series difference method (Temporal-Difference Methods, abbreviation TD) characteristic :
- TD Can pass bootstrapping From incomplete episodes Middle school learning
- TD updates a guess towards a guess
TD(λ)
The following figure for TD target In different n Diagram below :
As can be seen from the above figure , When n When the termination is reached , It's a episode, At this time, the corresponding method is MC, So from this point of view ,MC Belong to TD In special circumstances .
n-step Return
n-step returns It can be expressed as follows :
n=1 when :G_{t}^{(1)} = R_{t+1} + \gamma V(S_{t+1})
n=2 when :G_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})…
n=∞ when :G_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1} R_T)
therefore ,n-step return G_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{n}V(S_{t+n})
n-step TD Update formula :
V(S_t) ← V(S_t) + \alpha (G_t^{(n)} - V(S_t))
Forward View of TD(λ)
Can we put all n-step return combined ? The answer must be yes , After the combination of return Known as yes λ-return, among λ To combine different n-step returns The weight factor introduced .
λ-return:
G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}
Forward-view TD(λλ):
V(S_t) ← V(S_t) + \alpha\Bigl(G_t^{\lambda} - V(S_t)\Bigr)
TD(λ) The corresponding weight formula is (1−λ)λ^n−1^, The distribution diagram is as follows :
Forward-view TD(λ) Characteristics :
- Update value function towards the λ-return
- Forward-view looks into the future to compute GλtGtλ
- Like MC, can only be computed from complete episodes
Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
With qualification marks TD(λλ):
among δt by TD-error,Et(s) For qualification trace .
Qualification trail (Eligibility Traces)
The essence of qualification trace is for high frequency , The recent state gives higher trust (credit)/ The weight .
The following figure is a description of the qualification track :
About TD(λ) There is a conclusion :
The sum of offline updates is identical for forward-view and backward-view TD(λ).
The content of this section will not be introduced in depth , Those who are interested can read it Sutton Books and David A tutorial for .
Monte Carlo learning vs. Timing difference learning
MC And TD Similarities and differences
The same thing : They all learn given strategies online from experience π The value function of v~π~
Difference :
- Incremental every-visit Monte-Carlo: Towards the real return G~t~ to update V(S~t~)
V(S_t) ← V(S_t) + \alpha (\textcolor{Red}{G_t} - V(S_t))
- Simplest temporal-difference learning algorithm: TD(0)
- Towards the estimated return \color{Red}{R_{t+1} + \gamma V(S_{t+1})} to update V(S~t~)
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})} It's called TD target
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})}−V(S_{t}) It's called TD error
In the figure below Drive Home Give an example to illustrate the difference between the two ,MC The prediction of the time of returning home can only be changed after returning home , and TD In each step, constantly adjust your prediction according to the actual situation .
MC And TD Advantages and disadvantages
Learning style
- TD You can learn before you know the final result ( As an example above )
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD You can learn without the final result ( Such as infinite / continuity MDPs)
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Variance and bias
- MC has high variance, zero bias( High variance , Zero deviation )
- Good convergence properties
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias( Low variance , There is a certain deviation )
- Usually more efficient than MC
- TD(0) converges to vπ(s)vπ(s)
- More sensitive to initial value
About MC and TD Explanation of variance and deviation in :
- MC The update is based on real return G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1}R_{T} yes v~π~(St) Unbiased estimation of .
- Actual TD target R_{t+1} + \gamma v_{\pi}(S_{t+1}) It's also vπ(St) Unbiased estimation of . But it is used in the actual update TD target R_{t+1} + \gamma V(S_{t+1}) yes vπ(St) The biased estimate of .
- TD target With lower deviation :
- Return Each simulation depends on many random actions 、 Transfer probability and return
- TD target Rely on only one random action at a time 、 Transfer probability and return
Markov nature
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
DP、MC as well as TD(0)
First we start with backup tree Go up and intuitively understand the differences between the three .
- DP backup tree:Full-Width step( complete step)
- MC backup tree: complete episode
- TD(0) backup tree: Single step
Bootstrapping vs. Sampling
Bootstrapping: Update based on the predicted value
- DP bootstraps
- MC does not bootstrap
- TD bootstraps
Sampling: Update based on sampling expectations
- DP does not sample(model-based methods don’t need sample)
- MC samples(model-free methods need sample)
- TD samples(model-free methods need sample)
The figure below shows RL The difference between several basic methods :
边栏推荐
- 1000 words selected - interface test basis
- The difference between redirectto and navigateto in uniapp
- Use package FY in Oracle_ Recover_ Data. PCK to recover the table of truncate misoperation
- 《LaTex》LaTex数学公式简介「建议收藏」
- How about the order management of okcc call center
- 工程师如何对待开源 --- 一个老工程师的肺腑之言
- Leecode brushes questions to record interview questions 17.16 massagist
- Building lease management system based on SSM framework
- PDF文档签名指南
- "Latex" Introduction to latex mathematical formula "suggestions collection"
猜你喜欢
量子时代计算机怎么保证数据安全?美国公布四项备选加密算法
刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
509 certificat basé sur Go
The way of intelligent operation and maintenance application, bid farewell to the crisis of enterprise digital transformation
VTK volume rendering program design of 3D scanned volume data
Racher integrates LDAP to realize unified account login
AVL树到底是什么?
What is a responsive object? How to create a responsive object?
专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
Clipboard management tool paste Chinese version
随机推荐
uniapp实现从本地上传头像并显示,同时将头像转化为base64格式存储在mysql数据库中
Clipboard management tool paste Chinese version
What is a responsive object? How to create a responsive object?
rancher集成ldap,实现统一账号登录
在docker中快速使用各个版本的PostgreSQL数据库
工程师如何对待开源 --- 一个老工程师的肺腑之言
2022 PMP project management examination agile knowledge points (9)
PostgreSQL使用Pgpool-II实现读写分离+负载均衡
PDF文档签名指南
System activity monitor ISTAT menus 6.61 (1185) Chinese repair
X.509 certificate based on go language
48页数字政府智慧政务一网通办解决方案
Oracle中使用包FY_Recover_Data.pck来恢复truncate误操作的表
@TableId can‘t more than one in Class: “com.example.CloseContactSearcher.entity.Activity“.
DAY FOUR
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
Compilation of kickstart file
《LaTex》LaTex数学公式简介「建议收藏」
37頁數字鄉村振興智慧農業整體規劃建設方案
Introduction au GPIO