当前位置：网站首页>1. Finite Markov Decision Process

1. Finite Markov Decision Process

2022-07-03 10:09:00 【Most appropriate commitment】

Catalog

Finite Markov Decision Process

Definition

Formula( Bellman equation )

Optimal Bellman equation

RL an introduction link:

Sutton & Barto Book: Reinforcement Learning: AnIntroduction

Finite Markov Decision Process

Definition

The ego (agent) monitors the situations from the environment, such as by data flow or sensors (cameras, lidars or others), which is called state in the view of term. We have to highlight that we presume the agent clearly know enough information of the situations all the time, by which the agent could make their decisions. So we have $\mathbf{S_t}$ at the time step of t.

After the agent knows the current state, it could have finite actions to choose ( $\mathbf{A_t}$ ). And after taking an action, the agent will obtain the reward in this step ( $\mathbf{R_{t+1}}$ ) and run into the next state ( $\mathbf{S_{t+1}}$ ), in which process, the agent could know the environment's dynamics ( $\mathbf{pr( r^{'}, s^{'} | s, a)}$ ).Then the agent will continue deal with this state until this scenario ends.

The environment's dynamics are not decided by people, but the policy of taking which action depends on the agent's jugement. In every state $\mathbf{S_t}$ ,we could choose actions, which gives us more rewards totally not just in the short run, but also in the long run. Therefore, the policy of choosing actions in state is the core of reinforcement learning. we use $\mathbf{\pi (a|s)}$ to describe the probability of each action taken in the current state.

Therefore, the Finite Markov Decision Process is the process,in which the agent know the current state ,actions that is about to choose, even the probability of $\mathbf{R_{t+1}}$ and $\mathbf{S_{t+1}}$ for each action ( $\mathbf{pr( r^{'}, s^{'} | s, a)}$ ) and obtain the expected return in different policy.

Formula( Bellman equation )

Mathematically, we could calculate the value function $\mathbf{v_{\pi }}$ .

$v_{\pi } (s) = E[ G_t | S_t=s] = E[ R_{t+1} + \gamma G_{t+1} | S_t=s] = \sum_{a}^{}\pi (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_\pi ( s^{'} ) ]$

Consideration

For every scenario, we know the dynamics of the environment $\mathbf{pr( r^{'}, s^{'} | s, a)}$ , the state set $\mathbf{S_t}$ and coresponding action set $\mathbf{A_t}$ . For evey policy we set, we know $\mathbf{\pi (a|s)}$ . So we could obtain N equations for $\mathbf{v_\pi (s)}$ .

Limitation

Many times, we could not know the dynamics of the environment.
Many times, such as gammon, there are too many states. So we have no capacity to compute this probelm in this way ( solve equations )
problems have Markov property, which means $\mathbf{r^{'}}$ and $\mathbf{s^{'}}$ only depend on r and a. In other words, r and a could get all possible $\mathbf{r^{'}}$ and $\mathbf{s^{'}}$ .

Optimal policies

Definition

For policy $\pi$ and policy $\pi^{'}$ , if for each state s, the ineuqation can be fulfilled:

$v(\pi)\geq v(\pi^{'})$

then we can say $\pi$ is better than $\pi^{'}$ .

Therefore, there must be more than one policy, that is the optimal policy $\pi_{*}$ .

At the meantime, every state in policy $\pi_{*}$ also will meet Bellman equation.

$v_{\pi_{*} } (s) = \sum_{a}^{}\pi_{*} (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_{ \pi_{*} } ( s^{'} ) ]$

Optimal Bellman equation

for $\pi_{*}$ in the total policy set:

$v_{ \pi_{*}}(s)= max.\sum_{a} \pi(s|a)q(s,a) =max. \sum_{a}^{}\pi (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_{ \pi } ( s^{'} ) ]$