当前位置:网站首页>dynamic programming
dynamic programming
2022-07-07 00:27:00 【Evergreen AAS】
Dynamic programming
Dynamic programming (Dynamic Programming, abbreviation DP) It is a method to solve complex problems by decomposing the original problem into relatively simple subproblems .
Dynamic programming is often applied to problems with the following properties :
- With optimal substructure (Optimal substructure)
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblems (Overlapping subproblems)
- Subproblems recur many times
- Solutions can be cached and reused
Dynamic programming usually takes much less time than simple solution .
Markov decision process MDP Meet the above two properties :
- Behrman equation provides the structure of recursive decomposition ;
- The value function can save and reuse the results of recursion .
Use dynamic programming to solve MDP/MRP
Dynamic planning needs to meet MDP The process is known (model-based).
- For Predict:
- Input:MDP And strategy π Or is it MRP
- Output: Value function v~π~
- For Control:
- Input:MDP
- Output: The optimal value function v∗ Or the optimal strategy π∗
Strategy assessment
Strategy assessment (Policy Evaluation) It refers to calculating the value of a given strategy , Here's the problem “How to evaluate a policy”.
The idea of strategy evaluation : The iteration uses the Behrman expectation equation ( About MDP For the form of Behrman's expectation equation, see 《 Markov decision process 》).
The strategy evaluation process is shown in the figure below :
v_{k+1} = \sum_{a\in A}\pi(a|s) \Bigl( R_{s}^a + \gamma\sum_{s'\in S}P_{ss'}^a v_{k}(s') \Bigr)
Use vector form to express :
\mathbf{v^{k+1}} = \mathbf{R^{\pi}} + \gamma \mathbf{P^{\pi}v^{k}}
Strategy iteration
Strategy iteration (Policy Iteration, abbreviation PI) Here's the problem “How to improve a policy”.
Given a strategy π :
- Assessment strategy π
v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + ...| S_t = s]
- Promotion strategy : Improve strategy by adopting greedy methods : \pi ' = \text{greedy}(v_{\pi})
Can prove that , The strategy iteration can always converge to the optimal strategy , namely π′=π∗
Policy iterations can be formally described using the following figure :
Generalized strategy iteration
Through the strategic evaluation mentioned above, it is not difficult to find , Strategy evaluation is an iterative process :
v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + ...| S_t = s]
So here comes the question ,Does policy evaluation need to converge to vπvπ? Can we introduce a stop rule or specify in the iteration kk Stop policy evaluation after times ? Think more about , Why don't we improve the strategy in each iteration of strategy evaluation ( Equivalent to strategy evaluation iteration 1 Stop after )? notes : This is equivalent to the value iteration to be introduced later .
Therefore, we can generalize the above strategy iteration process , That is, generalized strategy iteration (Generalised Policy Iteration, abbreviation GPI) frame :
Value iteration
Before introducing value iteration , Let's first introduce the principle of optimization .
The principle of optimization
The principle of optimization (Principle of Optimality) Definition :
The optimal decision of a process has such properties : That is, regardless of its initial state and initial decision , For the process of taking the state formed by the first decision as the initial state , Must constitute an optimal strategy .
If the optimization principle is described in a little mathematical language, it is :
In state ss As the starting point , Strategy π(a|s) The optimal value can be obtained v_{\pi}(s) = v_*(s) If and only if :
- Any state s′ For the State s Can reach ;
- In state s′ As the starting point , Strategy π The optimal value can be obtained v_{\pi}(s’) = v_*(s’)
According to the optimization principle , If we get the solution of the subproblem v∗(s′)v∗(s′), Then in the state ss Is the optimal solution of the starting point v∗(s)v∗(s) You can step back (one-step lookahead) Can get :
v_*(s) ← \max_{a\in A}\Bigl(R_s^a + \gamma \sum_{s'\in S}P_{ss'}^{a}v_*(s') \Bigr)
in other words , We can go back from the end to get the optimal solution , Value iteration is to update iteratively based on the above idea .
MDP Value iteration
Value iteration (Value Iteration, abbreviation VI) The problem to be solved is also “Find optimal policy ππ”. But different from the strategy iteration using Behrman expectation equation , Value iteration uses the bellman optimal equation for iterative lifting .
The difference between value iteration and policy iteration is :
- Use Bellman optimal function, rather than Bellman expectation function
- Unlike policy iteration, there is no explicit policy
- Intermediate value functions may not correspond to any policy
As shown in the figure below :
v_{k+1}(s) = \max_{a\in A}\Bigl(R_s^a + \gamma\sum_{s'\in S}P_{ss'}^a v_k(s') \Bigr)
The corresponding vector is expressed as :
\mathbf{v}_{k+1} = \max_{a\in A}\mathbf{R}^a + \gamma \mathbf{P^av}^k
The following figure is a summary of the three methods :
Dynamic programming extension
Asynchronous dynamic programming (Asynchronous Dynamic Programming)
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming
Full-Width Backups vs. Sample Backups
Full-Width Backups
- DP uses full-width backups(DP is model-based)
- Every successor state and action is considered
- Using knowledge of the MDP transitions and reward function
- DP is effective for medium-sized problems (millions of states)
- For large problems, DP suffers Bellman’s curse of dimensionality( Dimension disaster )
Dimension disaster :Number of states n=|S| grows exponentially with number of state variables
- Even one backup can be too expensive
Sample Backups
The timing difference method to be discussed later
- Using sample rewards and sample transitions *S,A,R,S’*
- Instead of reward function R and transition dynamics P
- Advantages:
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of n=|S|
边栏推荐
- Three application characteristics of immersive projection in offline display
- kubernetes部署ldap
- VTK volume rendering program design of 3D scanned volume data
- 谷歌百度雅虎都是中国公司开发的通用搜索引擎_百度搜索引擎url
- Pytest multi process / multi thread execution test case
- Imeta | Chen Chengjie / Xia Rui of South China Agricultural University released a simple method of constructing Circos map by tbtools
- Quickly use various versions of PostgreSQL database in docker
- okcc呼叫中心的订单管理时怎么样的
- Operation test of function test basis
- Use type aliases in typescript
猜你喜欢
2022/2/11 summary
Tourism Management System Based on jsp+servlet+mysql framework [source code + database + report]
Core knowledge of distributed cache
JWT signature does not match locally computed signature. JWT validity cannot be asserted and should
Uniapp uploads and displays avatars locally, and converts avatars into Base64 format and stores them in MySQL database
专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
Introduction au GPIO
37页数字乡村振兴智慧农业整体规划建设方案
工程师如何对待开源 --- 一个老工程师的肺腑之言
随机推荐
Building lease management system based on SSM framework
web渗透测试是什么_渗透实战
DAY FOUR
MIT 6.824 - raft Student Guide
Rails 4 asset pipeline vendor asset images are not precompiled
37頁數字鄉村振興智慧農業整體規劃建設方案
Devops can help reduce technology debt in ten ways
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
Quickly use various versions of PostgreSQL database in docker
PXE server configuration
Typescript incremental compilation
C language input / output stream and file operation [II]
kubernetes部署ldap
Geo data mining (III) enrichment analysis of go and KEGG using David database
uniapp实现从本地上传头像并显示,同时将头像转化为base64格式存储在mysql数据库中
The way of intelligent operation and maintenance application, bid farewell to the crisis of enterprise digital transformation
DAY FIVE
@TableId can‘t more than one in Class: “com.example.CloseContactSearcher.entity.Activity“.
《LaTex》LaTex数学公式简介「建议收藏」