当前位置:网站首页>Markov decision process
Markov decision process
2022-07-07 00:26:00 【Evergreen AAS】
Plot task vs. Continuous mission
- Plot task (Episodic Tasks), All the tasks can be broken down into a series of plots , It can be seen as a limited step task .
- Continuous mission (Continuing Tasks), All tasks cannot be decomposed , It can be seen as an infinite step task .
Markov nature
Markov nature : When a stochastic process is in a given present state and all past states , The conditional probability distribution of its future state only depends on the current state .
Markov processes are Markov processes , That is, the conditional probability of the process is only related to the current state of the system , And it is independent of its past history or future state 、 uncorrelated .
Markov reward process
Markov reward process (Markov Reward Process,MRP) It is a Markov process with reward value , It can be represented by a quadruple
- S Is a finite set of States ;
- P Is the state transition matrix ,P_{ss^{‘}} = P[S_{t+1} = s^{‘}|S_t = s]
- R It's the reward function ;
- \gamma Is the discount factor (discount factor), among \gamma∈[0,1]
Reward function
stay t Reward value of the moment Gt :
G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}
Why Discount
About Return Why is it necessary to calculate \gamma Discount factor :
- The convenience of mathematical expression
- Avoid falling into an infinite cycle
- The long-term interest has certain uncertainty
- In Finance , Immediate return can gain more benefits than delayed return
- In line with the characteristics that human beings pay more attention to immediate interests
Value function
state s The long-term value function of is expressed as :
v(s)=E[Gt|St=s]
Bellman Equation for MRPs
\begin{align} v(s) &= E[G_t|S_t=s]\\ &= E[R_{t+1} + \gamma R_{t+2} + ... | S_t = s]\\ &= E[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} ... ) | S_t = s]\\ &= E[R_{t+1} + \gamma G_{t+1} | S_t = s]\\ &= E[R_{t+1} + \gamma v(s_{t+1}) | S_t = s] \end{align}
The following figure for MRP Of backup tree Sketch Map :
notes :backup tree The white circle in represents the state , The black dot corresponds to the action .
According to the above figure, we can further get :
v(s) = R_s + \gamma \sum_{s' \in S}P_{ss'}v(s')
Markov decision process
Markov decision process (Markov Decision Process,MDP) With decision MRP, It can be composed of a five tuple .
- S Is a finite set of States ;
- A For a limited set of actions ;
- P Is the state transition matrix ,P_{ss^{‘}}^{a} = P[S_{t+1} = s^{‘}|S_t = s,A_t=a]
- R It's the reward function ;
- \gamma Is the discount factor (discount factor), among \gamma∈[0,1] ]
We're talking about MDP Generally refers to limited ( discrete ) Markov decision process .
Strategy
Strategy (Policy) Is the action probability distribution in a given state , namely :
\pi(a|s) = P[A_t = a|S_t = a]
State value function & The optimal state value function
Given policy π Under the state of s State value function (State-Value Function) v_{\pi}(s)
v_{\pi}(s) = E_{\pi}[G_t|S_t = s]
state s The optimal state value function of (The Optimal State-Value Function)v~∗~(s)
v_{*}(s) = \max_{\pi}v_{\pi}(s)
Action value function & Optimal action value function
Given policy π, state s, Take action a Action value function (Action-Value Function)q~π~(s,a)
q_{\pi}(s, a) = E_{\pi}[G_t|S_t = s, A_t = a]
state s Take action a The optimal action value function of (The Optimal Action-Value Function)q∗(s,a):
q_{*}(s, a) = \max_{\pi}q_{\pi}(s, a)
The best strategy
If strategy π Better than strategy π′:
\pi \ge \pi^{'} \text{ if } v_{\pi}(s) \ge v_{\pi^{'}}(s), \forall{s}
The best strategy v∗ Satisfy :
- v∗≥π,∀π
- v~π∗~(s)=v~∗~(s)
- q~π∗~(s,a)=q~∗~(s,a)
How to find the best strategy ?
By maximizing q~∗~(s,a) To find the best strategy :
v_{*}(a|s) = \begin{cases} & 1 \text{ if } a=\arg\max_{a \in A}q_{*}(s,a)\\ & 0 \text{ otherwise } \end{cases}
about MDP Generally speaking, there is always a definite optimal strategy , And once we get q~∗~(s,a) , We can immediately find the optimal strategy .
Bellman Expectation Equation for MDPs
Let's first look at the state value function v^π^.
state s Corresponding backup tree As shown in the figure below :
According to the above figure :
v_{\pi}(s) = \sum_{a \in A}\pi(a|s)q_{\pi}(s, a) \qquad (1)
Let's look at the action value function q~π~(s,a)
state s, action a Corresponding backup tree As shown in the figure below :
So we can get :
q_{\pi}(s,a)=R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{\pi}(s') \qquad (2)
Further subdivision backup tree Look again. v^π^ And q~π~(s,a) Corresponding representation .
Subdivision status ss Corresponding backup tree As shown in the figure below :
Will formula (2) Put in the formula (1) You can further get vπ(s)vπ(s) Behrman's expectation equation :
v_{\pi}(s) = \sum_{a \in A} \pi(a | s) \Bigl( R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{\pi}(s') \Bigr) \qquad (3)
Subdivision status ss, action aa Corresponding backup tree As shown in the figure below :
Will formula (1) Put in the formula (2) You can get qπ(s,a) Behrman's expectation equation :
q_{\pi}(s,a)=R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a \Bigl(\sum_{a' \in A}\pi(a'|s')q_{\pi}(s', a') \Bigr) \qquad (4)
Bellman Optimality Equation for MDPs
Again, let's look at v~∗~(s):
The corresponding formula can be written :
v_{*}(s) = \max_{a}q_{*}(s, a) \qquad (5)
Look again. q~∗~(s,a):
The corresponding formula is :
q_{*}(s, a) = R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{*}(s') \qquad (6)
Get the same routine v∗(s) Corresponding backup tree And Behrman optimal equation :
Behrman's optimal equation :
v_{*}(s) = \max_{a} \Bigl( R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{*}(s') \Bigr) \qquad (7)
q∗(s,a) Corresponding backup tree And Behrman optimal equation :
The corresponding Behrman optimal equation :
R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a\max_{a}q_{*}(s, a) \qquad (8)
Characteristics of Behrman optimal equation
- nonlinear (non-linear)
- Usually, there is no analytical solution (no closed form solution)
Solution of Behrman's optimal equation
- Value Iteration
- Policy Iteration
- Sarsa
- Q-Learning
MDPs Related expansion problems
Infinite MDPs/ continuity MDPsPartially observable MDPsReward In the form of no discount factor MDPs/ Average Reward Formal MDPs
边栏推荐
- Use type aliases in typescript
- [2022 the finest in the whole network] how to test the interface test generally? Process and steps of interface test
- openresty ngx_ Lua subrequest
- 数据运营平台-数据采集[通俗易懂]
- DAY ONE
- 工程师如何对待开源 --- 一个老工程师的肺腑之言
- PostgreSQL uses pgpool II to realize read-write separation + load balancing
- 2022/2/10 summary
- Data operation platform - data collection [easy to understand]
- ldap创建公司组织、人员
猜你喜欢

How can computers ensure data security in the quantum era? The United States announced four alternative encryption algorithms

Devops can help reduce technology debt in ten ways

System activity monitor ISTAT menus 6.61 (1185) Chinese repair

Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化

2022年PMP项目管理考试敏捷知识点(9)

On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"

DAY THREE

1000字精选 —— 接口测试基础

After leaving a foreign company, I know what respect and compliance are

iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
随机推荐
一图看懂对程序员的误解:西方程序员眼中的中国程序员
Pytest multi process / multi thread execution test case
Designed for decision tree, the National University of Singapore and Tsinghua University jointly proposed a fast and safe federal learning system
Personal digestion of DDD
什么是响应式对象?响应式对象的创建过程?
工程师如何对待开源 --- 一个老工程师的肺腑之言
Leecode brush questions record sword finger offer 43 The number of occurrences of 1 in integers 1 to n
openresty ngx_ Lua subrequest
Use Yum or up2date to install the postgresql13.3 database
kubernetes部署ldap
Leecode brush question record sword finger offer 58 - ii Rotate string left
DAY FIVE
Google, Baidu and Yahoo are general search engines developed by Chinese companies_ Baidu search engine URL
VTK volume rendering program design of 3D scanned volume data
DAY SIX
谷歌百度雅虎都是中国公司开发的通用搜索引擎_百度搜索引擎url
【CVPR 2022】半监督目标检测:Dense Learning based Semi-Supervised Object Detection
Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)
Oracle EMCC 13.5 environment in docker every minute
17、 MySQL - high availability + read / write separation + gtid + semi synchronous master-slave replication cluster