当前位置:网站首页>Model-Free Prediction
Model-Free Prediction
2022-07-07 00:26:00 【Evergreen AAS】
The last article introduced Model-based The general method of —— Dynamic programming , The content of this paper Model-Free Under the circumstances Prediction problem , namely “Estimate the value function of an unknown MDP”.
- Model-based:MDP It is known that , That is, the transfer matrix and reward function are known
- Model-Free:MDP Unknown
Monte Carlo learning
Monte Carlo method (Monte-Carlo Methods, abbreviation MC) Also called Monte Carlo simulation , It means using random numbers ( Or more common pseudo-random numbers ) To solve a lot of computational problems . In fact, the essence is , Produce a posterior by acting as randomly as possible , Then the target system is characterized by a posteriori .
stay Model-Free Under the circumstances ,MC The application of reinforcement learning is to obtain the value function , Its characteristics are as follows :
- MC From the complete episodes Middle school learning (no bootstrapping)
- MC Calculate the value by the mean , namely value = mean(return)
- MC Only applicable to episodic MDPs( Co., LTD. MDPs)
First-Visit Monte Carlo strategy evaluation
First-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π~(s)) when , In a episode in , state s At the moment t The first time I was interviewed , Counter N(s)←N(s)+1 , Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s) = \frac{S(s)}{N(s)} According to the law of large numbers (Law of Large Numbers):V(s) → v_{\pi}(s) \text{ as } N(s) → \infty
Every-Visit Monte Carlo strategy evaluation
Every-Visit Monte-Carlo Policy Evaluation:
Evaluate the status s In a given strategy π Value function under v~π(~s) when , In a episode in , state s At the moment t Every time When interviewed , Counter N(s)←N(s)+1, Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s)=S(s)/N(s) According to the law of large numbers (Law of Large Numbers):V(s)→v~π~(s) as N(s)→∞
Incremental Monte-Carlo
Incremental averaging : The mean μ1,μ2,… of a sequence x1,x2,… . can be computed incrementally:
According to the above formula, we can get incremental MC Updated formula : Every time episode After the end , Incremental Updating V(s) , For each state St , Their corresponding return by Gt :
N(S_t) ← N(S_t) + 1 \\ V(S_t) ← V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t))
In non static problems , The form of the update formula can be changed as follows :
V(S_t) ← V(S_t) + \alpha (G_t - V(S_t))
Timing difference learning
Time series difference method (Temporal-Difference Methods, abbreviation TD) characteristic :
- TD Can pass bootstrapping From incomplete episodes Middle school learning
- TD updates a guess towards a guess
TD(λ)
The following figure for TD target In different n Diagram below :
As can be seen from the above figure , When n When the termination is reached , It's a episode, At this time, the corresponding method is MC, So from this point of view ,MC Belong to TD In special circumstances .
n-step Return
n-step returns It can be expressed as follows :
n=1 when :G_{t}^{(1)} = R_{t+1} + \gamma V(S_{t+1})
n=2 when :G_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})…
n=∞ when :G_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1} R_T)
therefore ,n-step return G_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{n}V(S_{t+n})
n-step TD Update formula :
V(S_t) ← V(S_t) + \alpha (G_t^{(n)} - V(S_t))
Forward View of TD(λ)
Can we put all n-step return combined ? The answer must be yes , After the combination of return Known as yes λ-return, among λ To combine different n-step returns The weight factor introduced .
λ-return:
G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}
Forward-view TD(λλ):
V(S_t) ← V(S_t) + \alpha\Bigl(G_t^{\lambda} - V(S_t)\Bigr)
TD(λ) The corresponding weight formula is (1−λ)λ^n−1^, The distribution diagram is as follows :
Forward-view TD(λ) Characteristics :
- Update value function towards the λ-return
- Forward-view looks into the future to compute GλtGtλ
- Like MC, can only be computed from complete episodes
Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
With qualification marks TD(λλ):
among δt by TD-error,Et(s) For qualification trace .
Qualification trail (Eligibility Traces)
The essence of qualification trace is for high frequency , The recent state gives higher trust (credit)/ The weight .
The following figure is a description of the qualification track :
About TD(λ) There is a conclusion :
The sum of offline updates is identical for forward-view and backward-view TD(λ).
The content of this section will not be introduced in depth , Those who are interested can read it Sutton Books and David A tutorial for .
Monte Carlo learning vs. Timing difference learning
MC And TD Similarities and differences
The same thing : They all learn given strategies online from experience π The value function of v~π~
Difference :
- Incremental every-visit Monte-Carlo: Towards the real return G~t~ to update V(S~t~)
V(S_t) ← V(S_t) + \alpha (\textcolor{Red}{G_t} - V(S_t))
- Simplest temporal-difference learning algorithm: TD(0)
- Towards the estimated return \color{Red}{R_{t+1} + \gamma V(S_{t+1})} to update V(S~t~)
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})} It's called TD target
- \color{Red}{R_{t+1} + \gamma V(S_{t+1})}−V(S_{t}) It's called TD error
In the figure below Drive Home Give an example to illustrate the difference between the two ,MC The prediction of the time of returning home can only be changed after returning home , and TD In each step, constantly adjust your prediction according to the actual situation .
MC And TD Advantages and disadvantages
Learning style
- TD You can learn before you know the final result ( As an example above )
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD You can learn without the final result ( Such as infinite / continuity MDPs)
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Variance and bias
- MC has high variance, zero bias( High variance , Zero deviation )
- Good convergence properties
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias( Low variance , There is a certain deviation )
- Usually more efficient than MC
- TD(0) converges to vπ(s)vπ(s)
- More sensitive to initial value
About MC and TD Explanation of variance and deviation in :
- MC The update is based on real return G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1}R_{T} yes v~π~(St) Unbiased estimation of .
- Actual TD target R_{t+1} + \gamma v_{\pi}(S_{t+1}) It's also vπ(St) Unbiased estimation of . But it is used in the actual update TD target R_{t+1} + \gamma V(S_{t+1}) yes vπ(St) The biased estimate of .
- TD target With lower deviation :
- Return Each simulation depends on many random actions 、 Transfer probability and return
- TD target Rely on only one random action at a time 、 Transfer probability and return
Markov nature
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
DP、MC as well as TD(0)
First we start with backup tree Go up and intuitively understand the differences between the three .
- DP backup tree:Full-Width step( complete step)
- MC backup tree: complete episode
- TD(0) backup tree: Single step
Bootstrapping vs. Sampling
Bootstrapping: Update based on the predicted value
- DP bootstraps
- MC does not bootstrap
- TD bootstraps
Sampling: Update based on sampling expectations
- DP does not sample(model-based methods don’t need sample)
- MC samples(model-free methods need sample)
- TD samples(model-free methods need sample)
The figure below shows RL The difference between several basic methods :
边栏推荐
- 华为mate8电池价格_华为mate8换电池后充电巨慢
- DAY THREE
- Pinia module division
- Business process testing based on functional testing
- DAY TWO
- The difference between redirectto and navigateto in uniapp
- Building lease management system based on SSM framework
- 刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
- TypeScript增量编译
- iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
猜你喜欢
AVL树到底是什么?
Geo data mining (III) enrichment analysis of go and KEGG using David database
陀螺仪的工作原理
How rider uses nuget package offline
2022/2/11 summary
pytest多进程/多线程执行测试用例
互动滑轨屏演示能为企业展厅带来什么
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
DAY SIX
随机推荐
如何判断一个数组中的元素包含一个对象的所有属性值
智能运维应用之道,告别企业数字化转型危机
2022/2/12 summary
Testers, how to prepare test data
C语言输入/输出流和文件操作【二】
1000字精选 —— 接口测试基础
Jenkins' user credentials plug-in installation
MySQL master-slave multi-source replication (3 master and 1 slave) setup and synchronization test
Leecode brush questions record sword finger offer 43 The number of occurrences of 1 in integers 1 to n
web渗透测试是什么_渗透实战
MIT 6.824 - raft Student Guide
Tourism Management System Based on jsp+servlet+mysql framework [source code + database + report]
DAY FIVE
DAY ONE
After leaving a foreign company, I know what respect and compliance are
使用yum来安装PostgreSQL13.3数据库
js导入excel&导出excel
Devops can help reduce technology debt in ten ways
SuperSocket 1.6 创建一个简易的报文长度在头部的Socket服务器
The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge