当前位置：网站首页>Markov decision process

Markov decision process

2022-07-07 00:26:00 【Evergreen AAS】

Plot task vs. Continuous mission

Plot task （Episodic Tasks）, All the tasks can be broken down into a series of plots , It can be seen as a limited step task .
Continuous mission （Continuing Tasks）, All tasks cannot be decomposed , It can be seen as an infinite step task .

Markov nature

Markov nature ： When a stochastic process is in a given present state and all past states , The conditional probability distribution of its future state only depends on the current state .

Markov processes are Markov processes , That is, the conditional probability of the process is only related to the current state of the system , And it is independent of its past history or future state 、 uncorrelated .

Markov reward process

Markov reward process （Markov Reward Process,MRP） It is a Markov process with reward value , It can be represented by a quadruple

S Is a finite set of States ;
P Is the state transition matrix ,P_{ss^{‘}} = P[S_{t+1} = s^{‘}|S_t = s]
R It's the reward function ;
\gamma Is the discount factor （discount factor）, among \gamma∈[0,1]

Reward function

stay t Reward value of the moment Gt :

G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}

Why Discount

About Return Why is it necessary to calculate \gamma Discount factor :

The convenience of mathematical expression
Avoid falling into an infinite cycle
The long-term interest has certain uncertainty
In Finance , Immediate return can gain more benefits than delayed return
In line with the characteristics that human beings pay more attention to immediate interests

Value function

state s The long-term value function of is expressed as ：

v(s)=E[Gt|St=s]

Bellman Equation for MRPs

\begin{align} v(s) &= E[G_t|S_t=s]\\ &= E[R_{t+1} + \gamma R_{t+2} + ... | S_t = s]\\ &= E[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} ... ) | S_t = s]\\ &= E[R_{t+1} + \gamma G_{t+1} | S_t = s]\\ &= E[R_{t+1} + \gamma v(s_{t+1}) | S_t = s] \end{align}

The following figure for MRP Of backup tree Sketch Map ：

notes ：backup tree The white circle in represents the state , The black dot corresponds to the action .

According to the above figure, we can further get ：

v(s) = R_s + \gamma \sum_{s' \in S}P_{ss'}v(s')

Markov decision process

Markov decision process （Markov Decision Process,MDP） With decision MRP, It can be composed of a five tuple .

S Is a finite set of States ;
A For a limited set of actions ;
P Is the state transition matrix ,P_{ss^{‘}}^{a} = P[S_{t+1} = s^{‘}|S_t = s,A_t=a]
R It's the reward function ;
\gamma Is the discount factor （discount factor）, among \gamma∈[0,1] ]

We're talking about MDP Generally refers to limited （ discrete ） Markov decision process .

Strategy

Strategy （Policy） Is the action probability distribution in a given state , namely ：

\pi(a|s) = P[A_t = a|S_t = a]

State value function & The optimal state value function

Given policy π Under the state of s State value function （State-Value Function） v_{\pi}(s)

v_{\pi}(s) = E_{\pi}[G_t|S_t = s]

state s The optimal state value function of （The Optimal State-Value Function）v~∗~(s)

v_{*}(s) = \max_{\pi}v_{\pi}(s)

Action value function & Optimal action value function

Given policy π, state s, Take action a Action value function （Action-Value Function）q~π~(s,a)

q_{\pi}(s, a) = E_{\pi}[G_t|S_t = s, A_t = a]

state s Take action a The optimal action value function of （The Optimal Action-Value Function）q∗(s,a):

q_{*}(s, a) = \max_{\pi}q_{\pi}(s, a)

The best strategy

If strategy π Better than strategy π′：

\pi \ge \pi^{'} \text{ if } v_{\pi}(s) \ge v_{\pi^{'}}(s), \forall{s}

The best strategy v∗ Satisfy ：

v∗≥π,∀π
v~π∗~(s)=v~∗~(s)
q~π∗~(s,a)=q~∗~(s,a)

How to find the best strategy ？

By maximizing q~∗~(s,a) To find the best strategy ：

v_{*}(a|s) = \begin{cases} & 1 \text{ if } a=\arg\max_{a \in A}q_{*}(s,a)\\ & 0 \text{ otherwise } \end{cases}

about MDP Generally speaking, there is always a definite optimal strategy , And once we get q~∗~(s,a) , We can immediately find the optimal strategy .