当前位置:网站首页>Markov decision process

Markov decision process

2022-07-07 00:26:00 Evergreen AAS

Plot task vs. Continuous mission

  • Plot task (Episodic Tasks), All the tasks can be broken down into a series of plots , It can be seen as a limited step task .
  • Continuous mission (Continuing Tasks), All tasks cannot be decomposed , It can be seen as an infinite step task .

Markov nature

Markov nature : When a stochastic process is in a given present state and all past states , The conditional probability distribution of its future state only depends on the current state .

Markov processes are Markov processes , That is, the conditional probability of the process is only related to the current state of the system , And it is independent of its past history or future state 、 uncorrelated .

Markov reward process

Markov reward process (Markov Reward Process,MRP) It is a Markov process with reward value , It can be represented by a quadruple

  • S Is a finite set of States ;
  • P Is the state transition matrix ,P_{ss^{‘}} = P[S_{t+1} = s^{‘}|S_t = s]
  • R It's the reward function ;
  • \gamma Is the discount factor (discount factor), among \gamma∈[0,1]

Reward function

stay t Reward value of the moment Gt :

G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}

Why Discount

About Return Why is it necessary to calculate \gamma Discount factor :

  • The convenience of mathematical expression
  • Avoid falling into an infinite cycle
  • The long-term interest has certain uncertainty
  • In Finance , Immediate return can gain more benefits than delayed return
  • In line with the characteristics that human beings pay more attention to immediate interests

Value function

state s The long-term value function of is expressed as :

v(s)=E[Gt|St=s]

Bellman Equation for MRPs

\begin{align} v(s) &= E[G_t|S_t=s]\\ &= E[R_{t+1} + \gamma R_{t+2} + ... | S_t = s]\\ &= E[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} ... ) | S_t = s]\\ &= E[R_{t+1} + \gamma G_{t+1} | S_t = s]\\ &= E[R_{t+1} + \gamma v(s_{t+1}) | S_t = s] \end{align}

The following figure for MRP Of backup tree Sketch Map :

notes :backup tree The white circle in represents the state , The black dot corresponds to the action .

According to the above figure, we can further get :

v(s) = R_s + \gamma \sum_{s' \in S}P_{ss'}v(s')

Markov decision process

Markov decision process (Markov Decision Process,MDP) With decision MRP, It can be composed of a five tuple .

  • S Is a finite set of States ;
  • A For a limited set of actions ;
  • P Is the state transition matrix ,P_{ss^{‘}}^{a} = P[S_{t+1} = s^{‘}|S_t = s,A_t=a]
  • R It's the reward function ;
  • \gamma Is the discount factor (discount factor), among \gamma∈[0,1] ]

We're talking about MDP Generally refers to limited ( discrete ) Markov decision process .

Strategy

Strategy (Policy) Is the action probability distribution in a given state , namely :

\pi(a|s) = P[A_t = a|S_t = a]

State value function & The optimal state value function

Given policy π Under the state of s State value function (State-Value Function) v_{\pi}(s)

v_{\pi}(s) = E_{\pi}[G_t|S_t = s]

state s The optimal state value function of (The Optimal State-Value Function)v~∗~(s)

v_{*}(s) = \max_{\pi}v_{\pi}(s)

Action value function & Optimal action value function

Given policy π, state s, Take action a Action value function (Action-Value Function)q~π~(s,a)

q_{\pi}(s, a) = E_{\pi}[G_t|S_t = s, A_t = a]

state s Take action a The optimal action value function of (The Optimal Action-Value Function)q∗(s,a):

q_{*}(s, a) = \max_{\pi}q_{\pi}(s, a)

The best strategy

If strategy π Better than strategy π′:

\pi \ge \pi^{'} \text{ if } v_{\pi}(s) \ge v_{\pi^{'}}(s), \forall{s}

The best strategy v∗ Satisfy :

  • v∗≥π,∀π
  • v~π∗~(s)=v~∗~(s)
  • q~π∗~(s,a)=q~∗~(s,a)

How to find the best strategy ?

By maximizing q~∗~(s,a) To find the best strategy :

v_{*}(a|s) = \begin{cases} & 1 \text{ if } a=\arg\max_{a \in A}q_{*}(s,a)\\ & 0 \text{ otherwise } \end{cases}

about MDP Generally speaking, there is always a definite optimal strategy , And once we get q~∗~(s,a) , We can immediately find the optimal strategy .

Bellman Expectation Equation for MDPs

Let's first look at the state value function v^π^.

state s Corresponding backup tree As shown in the figure below :

According to the above figure :

v_{\pi}(s) = \sum_{a \in A}\pi(a|s)q_{\pi}(s, a) \qquad (1)

Let's look at the action value function q~π~(s,a)

state s, action a Corresponding backup tree As shown in the figure below :

So we can get :

q_{\pi}(s,a)=R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{\pi}(s') \qquad (2)

Further subdivision backup tree Look again. v^π^ And q~π~(s,a) Corresponding representation .

Subdivision status ss Corresponding backup tree As shown in the figure below :

Will formula (2) Put in the formula (1) You can further get vπ(s)vπ(s) Behrman's expectation equation :

v_{\pi}(s) = \sum_{a \in A} \pi(a | s) \Bigl( R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{\pi}(s') \Bigr) \qquad (3)

Subdivision status ss, action aa Corresponding backup tree As shown in the figure below :

Will formula (1) Put in the formula (2) You can get qπ(s,a) Behrman's expectation equation :

q_{\pi}(s,a)=R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a \Bigl(\sum_{a' \in A}\pi(a'|s')q_{\pi}(s', a') \Bigr) \qquad (4)

Bellman Optimality Equation for MDPs

Again, let's look at v~∗~(s):

The corresponding formula can be written :

v_{*}(s) = \max_{a}q_{*}(s, a) \qquad (5)

Look again. q~∗~(s,a):

The corresponding formula is :

q_{*}(s, a) = R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{*}(s') \qquad (6)

Get the same routine v∗(s) Corresponding backup tree And Behrman optimal equation :

Behrman's optimal equation :

v_{*}(s) = \max_{a} \Bigl( R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a v_{*}(s') \Bigr) \qquad (7)

q∗(s,a) Corresponding backup tree And Behrman optimal equation :

The corresponding Behrman optimal equation :

R_s^a + \gamma \sum_{s'\in S}P_{ss'}^a\max_{a}q_{*}(s, a) \qquad (8)

Characteristics of Behrman optimal equation

  • nonlinear (non-linear)
  • Usually, there is no analytical solution (no closed form solution)

Solution of Behrman's optimal equation

  • Value Iteration
  • Policy Iteration
  • Sarsa
  • Q-Learning

MDPs Related expansion problems

  • Infinite MDPs/ continuity MDPs
  • Partially observable MDPs
  • Reward In the form of no discount factor MDPs/ Average Reward Formal MDPs
原网站

版权声明
本文为[Evergreen AAS]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202130959332797.html