当前位置:网站首页>dynamic programming
dynamic programming
2022-07-07 00:27:00 【Evergreen AAS】
Dynamic programming
Dynamic programming (Dynamic Programming, abbreviation DP) It is a method to solve complex problems by decomposing the original problem into relatively simple subproblems .
Dynamic programming is often applied to problems with the following properties :
- With optimal substructure (Optimal substructure)
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblems (Overlapping subproblems)
- Subproblems recur many times
- Solutions can be cached and reused
Dynamic programming usually takes much less time than simple solution .
Markov decision process MDP Meet the above two properties :
- Behrman equation provides the structure of recursive decomposition ;
- The value function can save and reuse the results of recursion .
Use dynamic programming to solve MDP/MRP
Dynamic planning needs to meet MDP The process is known (model-based).
- For Predict:
- Input:MDP And strategy π Or is it MRP
- Output: Value function v~π~
- For Control:
- Input:MDP
- Output: The optimal value function v∗ Or the optimal strategy π∗
Strategy assessment
Strategy assessment (Policy Evaluation) It refers to calculating the value of a given strategy , Here's the problem “How to evaluate a policy”.
The idea of strategy evaluation : The iteration uses the Behrman expectation equation ( About MDP For the form of Behrman's expectation equation, see 《 Markov decision process 》).
The strategy evaluation process is shown in the figure below :
v_{k+1} = \sum_{a\in A}\pi(a|s) \Bigl( R_{s}^a + \gamma\sum_{s'\in S}P_{ss'}^a v_{k}(s') \Bigr)
Use vector form to express :
\mathbf{v^{k+1}} = \mathbf{R^{\pi}} + \gamma \mathbf{P^{\pi}v^{k}}
Strategy iteration
Strategy iteration (Policy Iteration, abbreviation PI) Here's the problem “How to improve a policy”.
Given a strategy π :
- Assessment strategy π
v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + ...| S_t = s]
- Promotion strategy : Improve strategy by adopting greedy methods : \pi ' = \text{greedy}(v_{\pi})
Can prove that , The strategy iteration can always converge to the optimal strategy , namely π′=π∗
Policy iterations can be formally described using the following figure :
Generalized strategy iteration
Through the strategic evaluation mentioned above, it is not difficult to find , Strategy evaluation is an iterative process :
v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + ...| S_t = s]
So here comes the question ,Does policy evaluation need to converge to vπvπ? Can we introduce a stop rule or specify in the iteration kk Stop policy evaluation after times ? Think more about , Why don't we improve the strategy in each iteration of strategy evaluation ( Equivalent to strategy evaluation iteration 1 Stop after )? notes : This is equivalent to the value iteration to be introduced later .
Therefore, we can generalize the above strategy iteration process , That is, generalized strategy iteration (Generalised Policy Iteration, abbreviation GPI) frame :
Value iteration
Before introducing value iteration , Let's first introduce the principle of optimization .
The principle of optimization
The principle of optimization (Principle of Optimality) Definition :
The optimal decision of a process has such properties : That is, regardless of its initial state and initial decision , For the process of taking the state formed by the first decision as the initial state , Must constitute an optimal strategy .
If the optimization principle is described in a little mathematical language, it is :
In state ss As the starting point , Strategy π(a|s) The optimal value can be obtained v_{\pi}(s) = v_*(s) If and only if :
- Any state s′ For the State s Can reach ;
- In state s′ As the starting point , Strategy π The optimal value can be obtained v_{\pi}(s’) = v_*(s’)
According to the optimization principle , If we get the solution of the subproblem v∗(s′)v∗(s′), Then in the state ss Is the optimal solution of the starting point v∗(s)v∗(s) You can step back (one-step lookahead) Can get :
v_*(s) ← \max_{a\in A}\Bigl(R_s^a + \gamma \sum_{s'\in S}P_{ss'}^{a}v_*(s') \Bigr)
in other words , We can go back from the end to get the optimal solution , Value iteration is to update iteratively based on the above idea .
MDP Value iteration
Value iteration (Value Iteration, abbreviation VI) The problem to be solved is also “Find optimal policy ππ”. But different from the strategy iteration using Behrman expectation equation , Value iteration uses the bellman optimal equation for iterative lifting .
The difference between value iteration and policy iteration is :
- Use Bellman optimal function, rather than Bellman expectation function
- Unlike policy iteration, there is no explicit policy
- Intermediate value functions may not correspond to any policy
As shown in the figure below :
v_{k+1}(s) = \max_{a\in A}\Bigl(R_s^a + \gamma\sum_{s'\in S}P_{ss'}^a v_k(s') \Bigr)
The corresponding vector is expressed as :
\mathbf{v}_{k+1} = \max_{a\in A}\mathbf{R}^a + \gamma \mathbf{P^av}^k
The following figure is a summary of the three methods :
Dynamic programming extension
Asynchronous dynamic programming (Asynchronous Dynamic Programming)
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming
Full-Width Backups vs. Sample Backups
Full-Width Backups
- DP uses full-width backups(DP is model-based)
- Every successor state and action is considered
- Using knowledge of the MDP transitions and reward function
- DP is effective for medium-sized problems (millions of states)
- For large problems, DP suffers Bellman’s curse of dimensionality( Dimension disaster )
Dimension disaster :Number of states n=|S| grows exponentially with number of state variables
- Even one backup can be too expensive
Sample Backups
The timing difference method to be discussed later
- Using sample rewards and sample transitions *S,A,R,S’*
- Instead of reward function R and transition dynamics P
- Advantages:
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of n=|S|
边栏推荐
- Why should a complete knapsack be traversed in sequence? Briefly explain
- uniapp中redirectTo和navigateTo的区别
- 基於GO語言實現的X.509證書
- How rider uses nuget package offline
- Leecode brush questions record sword finger offer 11 Rotate the minimum number of the array
- 【自动化测试框架】关于unittest你需要知道的事
- [CVPR 2022] semi supervised object detection: dense learning based semi supervised object detection
- DAY FIVE
- Google, Baidu and Yahoo are general search engines developed by Chinese companies_ Baidu search engine URL
- Use source code compilation to install postgresql13.3 database
猜你喜欢

The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge

Pytest multi process / multi thread execution test case

如何判断一个数组中的元素包含一个对象的所有属性值

37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme

Data analysis course notes (V) common statistical methods, data and spelling, index and composite index

三维扫描体数据的VTK体绘制程序设计

Uniapp uploads and displays avatars locally, and converts avatars into Base64 format and stores them in MySQL database

DAY TWO

Three application characteristics of immersive projection in offline display

互动滑轨屏演示能为企业展厅带来什么
随机推荐
On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"
rancher集成ldap,实现统一账号登录
pinia 模块划分
Introduction to GPIO
[2022 the finest in the whole network] how to test the interface test generally? Process and steps of interface test
Designed for decision tree, the National University of Singapore and Tsinghua University jointly proposed a fast and safe federal learning system
pytest多进程/多线程执行测试用例
Things like random
【vulnhub】presidential1
MySQL learning notes (mind map)
三维扫描体数据的VTK体绘制程序设计
2022年PMP项目管理考试敏捷知识点(9)
Leecode brush questions record interview questions 32 - I. print binary tree from top to bottom
Huawei mate8 battery price_ Huawei mate8 charges very slowly after replacing the battery
AVL树到底是什么?
MIT 6.824 - Raft学生指南
DAY FIVE
St table
Racher integrates LDAP to realize unified account login
PostgreSQL highly available repmgr (1 master 2 slave +1witness) + pgpool II realizes master-slave switching + read-write separation