当前位置:网站首页>dynamic programming

dynamic programming

2022-07-07 00:27:00 Evergreen AAS

Dynamic programming

Dynamic programming (Dynamic Programming, abbreviation DP) It is a method to solve complex problems by decomposing the original problem into relatively simple subproblems .

Dynamic programming is often applied to problems with the following properties :

  • With optimal substructure (Optimal substructure)
    • Principle of optimality applies
    • Optimal solution can be decomposed into subproblems
  • Overlapping subproblems (Overlapping subproblems)
    • Subproblems recur many times
    • Solutions can be cached and reused

Dynamic programming usually takes much less time than simple solution .

Markov decision process MDP Meet the above two properties :

  • Behrman equation provides the structure of recursive decomposition ;
  • The value function can save and reuse the results of recursion .

Use dynamic programming to solve MDP/MRP

Dynamic planning needs to meet MDP The process is known (model-based).

  • For Predict:
    • Input:MDP And strategy π Or is it MRP
    • Output: Value function v~π~
  • For Control:
    • Input:MDP
    • Output: The optimal value function v∗ Or the optimal strategy π∗

Strategy assessment

Strategy assessment (Policy Evaluation) It refers to calculating the value of a given strategy , Here's the problem “How to evaluate a policy”.

The idea of strategy evaluation : The iteration uses the Behrman expectation equation ( About MDP For the form of Behrman's expectation equation, see 《 Markov decision process 》).

The strategy evaluation process is shown in the figure below :

v_{k+1} = \sum_{a\in A}\pi(a|s) \Bigl( R_{s}^a + \gamma\sum_{s'\in S}P_{ss'}^a v_{k}(s') \Bigr)

Use vector form to express :

\mathbf{v^{k+1}} = \mathbf{R^{\pi}} + \gamma \mathbf{P^{\pi}v^{k}}

Strategy iteration

Strategy iteration (Policy Iteration, abbreviation PI) Here's the problem “How to improve a policy”.

Given a strategy π :

  • Assessment strategy π

v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + ...| S_t = s]

  • Promotion strategy : Improve strategy by adopting greedy methods : \pi ' = \text{greedy}(v_{\pi})

Can prove that , The strategy iteration can always converge to the optimal strategy , namely π′=π∗

Policy iterations can be formally described using the following figure :

Generalized strategy iteration

Through the strategic evaluation mentioned above, it is not difficult to find , Strategy evaluation is an iterative process :

v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + ...| S_t = s]

So here comes the question ,Does policy evaluation need to converge to vπvπ? Can we introduce a stop rule or specify in the iteration kk Stop policy evaluation after times ? Think more about , Why don't we improve the strategy in each iteration of strategy evaluation ( Equivalent to strategy evaluation iteration 1 Stop after )? notes : This is equivalent to the value iteration to be introduced later .

Therefore, we can generalize the above strategy iteration process , That is, generalized strategy iteration (Generalised Policy Iteration, abbreviation GPI) frame :

Value iteration

Before introducing value iteration , Let's first introduce the principle of optimization .

The principle of optimization

The principle of optimization (Principle of Optimality) Definition :

The optimal decision of a process has such properties : That is, regardless of its initial state and initial decision , For the process of taking the state formed by the first decision as the initial state , Must constitute an optimal strategy .

If the optimization principle is described in a little mathematical language, it is :

In state ss As the starting point , Strategy π(a|s) The optimal value can be obtained v_{\pi}(s) = v_*(s) If and only if :

  • Any state s′ For the State s Can reach ;
  • In state s′ As the starting point , Strategy π The optimal value can be obtained v_{\pi}(s’) = v_*(s’)

According to the optimization principle , If we get the solution of the subproblem v∗(s′)v∗(s′), Then in the state ss Is the optimal solution of the starting point v∗(s)v∗(s) You can step back (one-step lookahead) Can get :

v_*(s) ← \max_{a\in A}\Bigl(R_s^a + \gamma \sum_{s'\in S}P_{ss'}^{a}v_*(s') \Bigr)

in other words , We can go back from the end to get the optimal solution , Value iteration is to update iteratively based on the above idea .

MDP Value iteration

Value iteration (Value Iteration, abbreviation VI) The problem to be solved is also “Find optimal policy ππ”. But different from the strategy iteration using Behrman expectation equation , Value iteration uses the bellman optimal equation for iterative lifting .

The difference between value iteration and policy iteration is :

  • Use Bellman optimal function, rather than Bellman expectation function
  • Unlike policy iteration, there is no explicit policy
  • Intermediate value functions may not correspond to any policy

As shown in the figure below :

v_{k+1}(s) = \max_{a\in A}\Bigl(R_s^a + \gamma\sum_{s'\in S}P_{ss'}^a v_k(s') \Bigr)

The corresponding vector is expressed as :

\mathbf{v}_{k+1} = \max_{a\in A}\mathbf{R}^a + \gamma \mathbf{P^av}^k

The following figure is a summary of the three methods :

Dynamic programming extension

Asynchronous dynamic programming (Asynchronous Dynamic Programming)

  • In-place dynamic programming
  • Prioritised sweeping
  • Real-time dynamic programming

Full-Width Backups vs. Sample Backups

Full-Width Backups

  • DP uses full-width backups(DP is model-based)
    • Every successor state and action is considered
    • Using knowledge of the MDP transitions and reward function
  • DP is effective for medium-sized problems (millions of states)
  • For large problems, DP suffers Bellman’s curse of dimensionality( Dimension disaster )

Dimension disaster :Number of states n=|S| grows exponentially with number of state variables

  • Even one backup can be too expensive

Sample Backups

The timing difference method to be discussed later

  • Using sample rewards and sample transitions *S,A,R,S’*
  • Instead of reward function R and transition dynamics P
  • Advantages:
    • Model-free: no advance knowledge of MDP required
    • Breaks the curse of dimensionality through sampling
    • Cost of backup is constant, independent of n=|S|
原网站

版权声明
本文为[Evergreen AAS]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202130959332756.html