当前位置:网站首页>Model-Free Control
Model-Free Control
2022-07-07 00:27:00 【Evergreen AAS】
The last one is a summary Model-Free Predict Problems and methods , The content of this paper Model-Free Control Method , namely “Optimise the value function of an unknown MDP”.
Explain here ,Model-Free Predict/Control Not just for Model-Free The situation of , It also applies to MDP Known problem :
- MDP model is unknown, but experience can be sampled.
- MDP model is known, but is too big to use, except by samples.
In the official introduction Model-Free Control Before method , Let's introduce On-policy Learning And Off-policy Learning.
On-policy Learning vs. Off-policy Learning
On-policy Learning:
- “Learn on the job”
- Learn about policy π from experience sampled from π( That is, the sampling strategy is consistent with the learning strategy )
Off-policy Learning:
- “Look over someone’s shoulder”
- Learn about policy π from experience sampled from μ( That is, the sampling strategy is inconsistent with the learning strategy )
On-Policy Monte-Carlo Learning
Generalized Policy Iteration
Concrete Control Method , stay 《 Dynamic programming 》 In the article, we mentioned Model-based Generalized strategy iteration under GPI frame , That's in Model-Free Is it also applicable in case ? The picture below shows Model-based Generalized strategy iteration under GPI frame , There are mainly two parts : Strategy evaluation and based on Greedy Strategy improvement of strategy .
Model-Free Strategy assessment
stay 《Model-Free Predict》 We introduced two kinds of Model-Free Strategy evaluation method :MC and TD. Let's talk about using MC Cases of Model-Free Strategy assessment . Pictured above GPI As shown in the frame :
- be based on V(s) The improvement of greedy strategy needs MDP It is known that :
\pi'(s) = \arg\max_{a\in A}\Bigl(R_{s}^{a}+P_{ss'}^{a}V(s')\Bigr)
- be based on Q(s,a) There is no need to MDP It is known that , namely Model-Free:
\pi'(s) = \arg\max_{a\in A}Q(s, a)
therefore Model-Free I need to be right Q(s,a) Strategy assessment , Whole GPI Strategy iteration should also be based on Q(s,a)
Model-Free Strategy improvement
Determine the object of strategy evaluation , The next thing to consider is how to evaluate the results based on strategies Q(s,a) Carry out strategy improvement . because Model-Free Our strategy evaluation is based on experience samples( That is, evaluated q(s,a) There is bias), So we don't use pure greedy Strategy , Prevent the deviation of strategy evaluation from leading the whole strategy iteration to local optimization , Instead, it uses explore Functional ϵ-greedy Algorithm :
\pi(a|s) = \begin{cases} &\frac{\epsilon}{m} + 1 - \epsilon, &\text{if } a^*=\arg\max_{a\in A}Q(s, a)\\ &\frac{\epsilon}{m}, &\text{otherwise} \end{cases}
therefore , We're sure Model-Free Under the Monto-Carlo Control:
GLIE
First, post it directly David The courseware of ,GLIE Introduce the following :
about ϵ-greedy In terms of algorithm , If ϵ As the number of iterations gradually decreases to 0, that ϵ-greedy yes GLIE, namely :
\epsilon_{k} = \frac{1}{k}
GLIE Monto-Carlo Control
- about episode Each state in S~t~ And the action A~t~
N(S_t, A_t) ← N(S_t, A_t) + 1 \\ Q(S_t, A_t) ← Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t))
- Based on the new action value function promotion strategy :
\epsilon ← \frac{1}{k}\\ \pi ← \epsilon\text{-greedy}(Q)
Theorem :GLIE Monto-Carlo Control Converge to the optimal action value function , namely :
Q(s, a) → q_*(s, a)
On-Policy Temporal-Difference Learning
Sarsa
We have concluded before TD relative MC The advantages of :
- Low variance
- Online
- Incomplete sequence
Then a natural idea is to use TD Instead of MC:
- Use TD To calculate Q(S,A)
- Still use ϵ-greedy Strategy improvement
- every last step updated
Through the above changes On-Policy The Monte Carlo method of became famous Sarsa.
- Update the action value function
- Control
Sarsa The pseudo-code of the algorithm is as follows :
Sarsa(λ)
n-step Sarsa returns It can be expressed as follows :n=1 when :q_{t}^{(1)} = R_{t+1} + \gamma Q(S_{t+1})n=2 when :q_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 Q(S_{t+2})…n=∞ when :q_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1} R_T) therefore ,n-step return q_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{n}Q(S_{t+n})
n-step Sarse Update formula :
Q(S_t,A_t) ← Q(S_t,A_t) + \alpha (q_t^{(n)} - Q(S_t,A_t))
Concrete Sarsa(λ) The algorithm pseudo code is as follows :
among E(s,a) For qualification trace .
The following figure for Sarsa(λ) be used for Gridworld Schematic diagram of the example :
Off-Policy Learning
Off-Policy Learning The characteristic of is to evaluate the target strategy π(a|s) To calculate v~π~(s) perhaps q~π~(s,a) But follow behavioral strategies {S~1~,A~1~,R~2~,…,S~T~}∼μ(a|s)
Off-Policy Learning What's the point ?
- Learn from observing humans or other agents
- Re-use experience generated from old policies π~1~,π~2~,…,π~t−1~
- Learn about optimal policy while following exploratory policy
- Learn about multiple policies while following one policy
Importance sampling
The purpose of importance sampling is :Estimate the expectation of a different distribution.
\begin{align} E_{X\sim P}[f(X)] &= \sum P(X)f(X)\\ &= \sum Q(X)\frac{P(X)}{Q(X)}f(X)\\ &= E_{X\sim Q}[\frac{P(X)}{Q(X)}f(X)] \end{align}
Off-Policy MC Importance sampling
Use policy π Produced return To assess the μ :
G_t^{\pi/\mu} = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})}...\frac{\pi(A_T|S_T)}{\mu(A_T|S_T)}G_t
Towards the right return Direction to update value :
V(S_t) ← V(S_t) + \alpha\Bigl(\textcolor{Red}{G_t^{\pi/\mu}}-V(S_t)\Bigr)
Two things to note :
- Cannot use if μ is zero when π is non-zero
- Importance sampling will significantly increase variance
Off-Policy TD Importance sampling
TD It's a one-step , So use strategy π Produced TD targets To assess the μ :
V(S_t) ← V(S_t) + \alpha\Bigl(\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma V(S_{t+1}))-V(S_t)\Bigr)
- Variance ratio MC The importance of version sampling is much lower
Q-Learning
Previously, we introduced the value function V(s) Conduct off-policy Study , Now let's discuss how to evaluate the action value function Q(s,a) Conduct off-policy Study :
- No importance sampling is required
- Use behavioral strategies to choose the next action :A_{t+1}\sim\mu(·|S_t)
- But there is still another follow-up action to consider :A’\sim\pi(·|S_t)
- Update towards the value of another subsequent action Q(S_t, A_t)
Q(S_t, A_t) ← Q(S_t, A_t) + \alpha\Bigl(R_{t+1}+\gamma Q(S_{t+1}, A')-Q(S_t, A_t)\Bigr)
After discussing the learning of action value function , Let's see how to pass Q-Learning Conduct Control:
- Both behavior strategy and goal strategy are improved
- Target strategy π With greedy Ways to improve :
\pi(S_t) = \arg\max_{a'}Q(S_{t+1}, a')
- Behavioral strategies μ With ϵ-greedy Ways to improve
- Q-Learning target:
\begin{align} &R_{t+1}+\gamma Q(S_{t+1}, A')\\ =&R_{t+1}+\gamma Q\Bigl(S_{t+1}, \arg\max_{a'}Q(S_{t+1}, a')\Bigr)\\ =&R_{t+1}+\max_{a'}\gamma Q(S_{t+1}, a') \end{align}
Q-Learning Of backup tree As shown below :
About Q-Learning Conclusion :
Q-learning control converges to the optimal action-value function, Q(s,a)→q~∗~(s,a)
Q-Learning The specific pseudo code of the algorithm is as follows :
contrast Sarsa And Q-Learning Two of the most important differences can be found :
- TD target Different formulas
- Q-Learning The next action in is selected from the behavior strategy , Not the target strategy
DP vs. TD
The differences between the two are shown in the table below :
边栏推荐
- GPIO簡介
- ldap创建公司组织、人员
- PostgreSQL highly available repmgr (1 master 2 slave +1witness) + pgpool II realizes master-slave switching + read-write separation
- @TableId can‘t more than one in Class: “com.example.CloseContactSearcher.entity.Activity“.
- Leecode brushes questions and records interview questions 01.02 Determine whether it is character rearrangement for each other
- 刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
- [CVPR 2022] semi supervised object detection: dense learning based semi supervised object detection
- [vector retrieval research series] product introduction
- 沉浸式投影在线下展示中的三大应用特点
- PostgreSQL高可用之repmgr(1主2从+1witness)+Pgpool-II实现主从切换+读写分离
猜你喜欢
专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
Data analysis course notes (III) array shape and calculation, numpy storage / reading data, indexing, slicing and splicing
JWT signature does not match locally computed signature. JWT validity cannot be asserted and should
【vulnhub】presidential1
Racher integrates LDAP to realize unified account login
509 certificat basé sur Go
Interface master v3.9, API low code development tool, build your interface service platform immediately
互动滑轨屏演示能为企业展厅带来什么
X.509 certificate based on go language
48页数字政府智慧政务一网通办解决方案
随机推荐
从外企离开,我才知道什么叫尊重跟合规…
Leecode brushes questions and records interview questions 01.02 Determine whether it is character rearrangement for each other
pinia 模块划分
A way of writing SQL, update when matching, or insert
Random类的那些事
Clipboard management tool paste Chinese version
System activity monitor ISTAT menus 6.61 (1185) Chinese repair
DAY ONE
The difference between redirectto and navigateto in uniapp
File and image comparison tool kaleidoscope latest download
Win10 startup error, press F9 to enter how to repair?
openresty ngx_ Lua subrequest
Designed for decision tree, the National University of Singapore and Tsinghua University jointly proposed a fast and safe federal learning system
Data operation platform - data collection [easy to understand]
Things like random
Three application characteristics of immersive projection in offline display
kubernetes部署ldap
MySQL主从之多源复制(3主1从)搭建及同步测试
DAY THREE
Leecode brushes questions to record interview questions 17.16 massagist