当前位置:网站首页>Model-Free Prediction
Model-Free Prediction
2022-07-07 00:26:00 【Evergreen AAS】
The last article introduced Model-based The general method of —— Dynamic programming , The content of this paper Model-Free Under the circumstances Prediction problem , namely “Estimate the value function of an unknown MDP”.
- Model-based:MDP It is known that , That is, the transfer matrix and reward function are known
- Model-Free:MDP Unknown
Monte Carlo learning
Monte Carlo method (Monte-Carlo Methods, abbreviation MC) Also called Monte Carlo simulation , It means using random numbers ( Or more common pseudo-random numbers ) To solve a lot of computational problems . In fact, the essence is , Produce a posterior by acting as randomly as possible , Then the target system is characterized by a posteriori .
stay Model-Free Under the circumstances ,MC The application of reinforcement learning is to obtain the value function , Its characteristics are as follows :
- MC From the complete episodes Middle school learning (no bootstrapping)
- MC Calculate the value by the mean , namely value = mean(return)
- MC Only applicable to episodic MDPs( Co., LTD. MDPs)
First-Visit Monte Carlo strategy evaluation
First-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π~(s)) when , In a episode in , state s At the moment t The first time I was interviewed , Counter N(s)←N(s)+1 , Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s) = \frac{S(s)}{N(s)} According to the law of large numbers (Law of Large Numbers):V(s) → v_{\pi}(s) \text{ as } N(s) → \infty
Every-Visit Monte Carlo strategy evaluation
Every-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π(~s) when , In a episode in , state s At the moment t Every time When interviewed , Counter N(s)←N(s)+1, Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s)=S(s)/N(s) According to the law of large numbers (Law of Large Numbers):V(s)→v~π~(s) as N(s)→∞
Incremental Monte-Carlo
Incremental averaging : The mean μ1,μ2,… of a sequence x1,x2,… . can be computed incrementally:
According to the above formula, we can get incremental MC Updated formula : Every time episode After the end , Incremental Updating V(s) , For each state St , Their corresponding return by Gt :
N(S_t) ← N(S_t) + 1 \\ V(S_t) ← V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t))
In non static problems , The form of the update formula can be changed as follows :
V(S_t) ← V(S_t) + \alpha (G_t - V(S_t))
Timing difference learning
Time series difference method (Temporal-Difference Methods, abbreviation TD) characteristic :
- TD Can pass bootstrapping From incomplete episodes Middle school learning
- TD updates a guess towards a guess
TD(λ)
The following figure for TD target In different n Diagram below :
As can be seen from the above figure , When n When the termination is reached , It's a episode, At this time, the corresponding method is MC, So from this point of view ,MC Belong to TD In special circumstances .
n-step Return
n-step returns It can be expressed as follows :
n=1 when :G_{t}^{(1)} = R_{t+1} + \gamma V(S_{t+1})
n=2 when :G_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})…
n=∞ when :G_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1} R_T)
therefore ,n-step return G_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{n}V(S_{t+n})
n-step TD Update formula :
V(S_t) ← V(S_t) + \alpha (G_t^{(n)} - V(S_t))
Forward View of TD(λ)
Can we put all n-step return combined ? The answer must be yes , After the combination of return Known as yes λ-return, among λ To combine different n-step returns The weight factor introduced .
λ-return:
G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}
Forward-view TD(λλ):
V(S_t) ← V(S_t) + \alpha\Bigl(G_t^{\lambda} - V(S_t)\Bigr)
TD(λ) The corresponding weight formula is (1−λ)λ^n−1^, The distribution diagram is as follows :
Forward-view TD(λ) Characteristics :
- Update value function towards the λ-return
- Forward-view looks into the future to compute GλtGtλ
- Like MC, can only be computed from complete episodes
Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
With qualification marks TD(λλ):
among δt by TD-error,Et(s) For qualification trace .
Qualification trail (Eligibility Traces)
The essence of qualification trace is for high frequency , The recent state gives higher trust (credit)/ The weight .
The following figure is a description of the qualification track :
About TD(λ) There is a conclusion :
The sum of offline updates is identical for forward-view and backward-view TD(λ).
The content of this section will not be introduced in depth , Those who are interested can read it Sutton Books and David A tutorial for .
Monte Carlo learning vs. Timing difference learning
MC And TD Similarities and differences
The same thing : They all learn given strategies online from experience π The value function of v~π~
Difference :
- Incremental every-visit Monte-Carlo: Towards the real return G~t~ to update V(S~t~)
V(S_t) ← V(S_t) + \alpha (\textcolor{Red}{G_t} - V(S_t))
- Simplest temporal-difference learning algorithm: TD(0)
- Towards the estimated return \color{Red}{R_{t+1} + \gamma V(S_{t+1})} to update V(S~t~)
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})} It's called TD target
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})}−V(S_{t}) It's called TD error
In the figure below Drive Home Give an example to illustrate the difference between the two ,MC The prediction of the time of returning home can only be changed after returning home , and TD In each step, constantly adjust your prediction according to the actual situation .
MC And TD Advantages and disadvantages
Learning style
- TD You can learn before you know the final result ( As an example above )
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD You can learn without the final result ( Such as infinite / continuity MDPs)
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Variance and bias
- MC has high variance, zero bias( High variance , Zero deviation )
- Good convergence properties
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias( Low variance , There is a certain deviation )
- Usually more efficient than MC
- TD(0) converges to vπ(s)vπ(s)
- More sensitive to initial value
About MC and TD Explanation of variance and deviation in :
- MC The update is based on real return G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1}R_{T} yes v~π~(St) Unbiased estimation of .
- Actual TD target R_{t+1} + \gamma v_{\pi}(S_{t+1}) It's also vπ(St) Unbiased estimation of . But it is used in the actual update TD target R_{t+1} + \gamma V(S_{t+1}) yes vπ(St) The biased estimate of .
- TD target With lower deviation :
- Return Each simulation depends on many random actions 、 Transfer probability and return
- TD target Rely on only one random action at a time 、 Transfer probability and return
Markov nature
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
DP、MC as well as TD(0)
First we start with backup tree Go up and intuitively understand the differences between the three .
- DP backup tree:Full-Width step( complete step)
- MC backup tree: complete episode
- TD(0) backup tree: Single step
Bootstrapping vs. Sampling
Bootstrapping: Update based on the predicted value
- DP bootstraps
- MC does not bootstrap
- TD bootstraps
Sampling: Update based on sampling expectations
- DP does not sample(model-based methods don’t need sample)
- MC samples(model-free methods need sample)
- TD samples(model-free methods need sample)
The figure below shows RL The difference between several basic methods :
边栏推荐
- Three application characteristics of immersive projection in offline display
- Three sentences to briefly introduce subnet mask
- Things like random
- The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge
- Racher integrates LDAP to realize unified account login
- DAY SIX
- 智能运维应用之道,告别企业数字化转型危机
- kubernetes部署ldap
- MySQL master-slave multi-source replication (3 master and 1 slave) setup and synchronization test
- 2022 PMP project management examination agile knowledge points (9)
猜你喜欢

2022年PMP项目管理考试敏捷知识点(9)

37 page overall planning and construction plan for digital Village revitalization of smart agriculture

How can computers ensure data security in the quantum era? The United States announced four alternative encryption algorithms

互动滑轨屏演示能为企业展厅带来什么

What is a responsive object? How to create a responsive object?

Three application characteristics of immersive projection in offline display

St table

iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法

Introduction au GPIO

On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"
随机推荐
刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
【向量检索研究系列】产品介绍
System activity monitor ISTAT menus 6.61 (1185) Chinese repair
Pdf document signature Guide
37 page overall planning and construction plan for digital Village revitalization of smart agriculture
pinia 模块划分
DAY ONE
Imeta | Chen Chengjie / Xia Rui of South China Agricultural University released a simple method of constructing Circos map by tbtools
C语言输入/输出流和文件操作【二】
[automated testing framework] what you need to know about unittest
How to use vector_ How to use vector pointer
[boutique] Pinia Persistence Based on the plug-in Pinia plugin persist
Rails 4 asset pipeline vendor asset images are not precompiled
Uniapp uploads and displays avatars locally, and converts avatars into Base64 format and stores them in MySQL database
DAY TWO
openresty ngx_lua子请求
Google, Baidu and Yahoo are general search engines developed by Chinese companies_ Baidu search engine URL
DAY FOUR
在docker中快速使用各个版本的PostgreSQL数据库
三维扫描体数据的VTK体绘制程序设计