当前位置:网站首页>1. Finite Markov Decision Process
1. Finite Markov Decision Process
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Finite Markov Decision Process
RL an introduction link:
Sutton & Barto Book: Reinforcement Learning: AnIntroduction
Finite Markov Decision Process
Definition
The ego (agent) monitors the situations from the environment, such as by data flow or sensors (cameras, lidars or others), which is called state in the view of term. We have to highlight that we presume the agent clearly know enough information of the situations all the time, by which the agent could make their decisions. So we have at the time step of t.
After the agent knows the current state, it could have finite actions to choose ( ). And after taking an action, the agent will obtain the reward in this step (
) and run into the next state (
), in which process, the agent could know the environment's dynamics (
).Then the agent will continue deal with this state until this scenario ends.
The environment's dynamics are not decided by people, but the policy of taking which action depends on the agent's jugement. In every state ,we could choose actions, which gives us more rewards totally not just in the short run, but also in the long run. Therefore, the policy of choosing actions in state is the core of reinforcement learning. we use
to describe the probability of each action taken in the current state.
Therefore, the Finite Markov Decision Process is the process,in which the agent know the current state ,actions that is about to choose, even the probability of and
for each action (
) and obtain the expected return in different policy.
Formula( Bellman equation )
Mathematically, we could calculate the value function .
Consideration
For every scenario, we know the dynamics of the environment , the state set
and coresponding action set
. For evey policy we set, we know
. So we could obtain N equations for
.
Limitation
- Many times, we could not know the dynamics of the environment.
- Many times, such as gammon, there are too many states. So we have no capacity to compute this probelm in this way ( solve equations )
- problems have Markov property, which means
and
only depend on r and a. In other words, r and a could get all possible
and
.
Optimal policies
Definition
For policy and policy
, if for each state s, the ineuqation can be fulfilled:
then we can say is better than
.
Therefore, there must be more than one policy, that is the optimal policy .
At the meantime, every state in policy also will meet Bellman equation.
Optimal Bellman equation
for in the total policy set:
For a specific case, the environment's dynamic is constant. We can only change the apportionment of .
For a maximum v(s), we should apportion 1 to the max q(s,a).
Therefore, the optimal policy is actually a greedy policy without any exploration.
边栏推荐
- Openeuler kernel technology sharing - Issue 1 - kdump basic principle, use and case introduction
- Replace the files under the folder with sed
- CV learning notes - feature extraction
- Liquid crystal display
- pycharm 无法引入自定义包
- openEuler kernel 技術分享 - 第1期 - kdump 基本原理、使用及案例介紹
- LeetCode - 1670 设计前中后队列(设计 - 两个双端队列)
- After clicking the Save button, you can only click it once
- Vscode markdown export PDF error
- LeetCode - 919. 完全二叉树插入器 (数组)
猜你喜欢
Interruption system of 51 single chip microcomputer
Blue Bridge Cup for migrant workers majoring in electronic information engineering
Leetcode - 933 number of recent requests
Windows下MySQL的安装和删除
Openeuler kernel technology sharing - Issue 1 - kdump basic principle, use and case introduction
CV learning notes - edge extraction
2312. Selling wood blocks | things about the interviewer and crazy Zhang San (leetcode, with mind map + all solutions)
Working mode of 80C51 Serial Port
51 MCU tmod and timer configuration
The underlying principle of vector
随机推荐
JS基础-原型原型链和宏任务/微任务/事件机制
getopt_ Typical use of long function
openEuler kernel 技术分享 - 第1期 - kdump 基本原理、使用及案例介绍
2. Elment UI date selector formatting problem
My 4G smart charging pile gateway design and development related articles
Exception handling of arm
Windows下MySQL的安装和删除
Serial communication based on 51 single chip microcomputer
4G module board level control interface designed by charging pile
Leetcode - 895 maximum frequency stack (Design - hash table + priority queue hash table + stack)*
03 fastjason solves circular references
LeetCode - 1172 餐盘栈 (设计 - List + 小顶堆 + 栈))
LeetCode - 1670 設計前中後隊列(設計 - 兩個雙端隊列)
Tensorflow built-in evaluation
On the problem of reference assignment to reference
RESNET code details
Swing transformer details-1
Tensorflow2.0 save model
Stm32 NVIC interrupt priority management
CV learning notes - feature extraction