当前位置：网站首页>Exercises in Chapter II of intensive learning

Exercises in Chapter II of intensive learning

2022-07-23 06:41:00 【Infinite power】

Markov properties （Markov Property)： The next state of a state is only related to the current state , It has nothing to do with the past
Markov chain （Markov Chain): Stochastic processes with Markov properties and existing in discrete exponential sets and state spaces
State transition matrix ： Each line describes the probability of reaching all other nodes from one node
Markov reward process （Markov Reward) A reward function is added to the Markov chain
horizon: Defines the same episode Or the length of the whole track , Determined by a finite number of steps
return : Discount the reward , Then get the corresponding income
Ballman Equation:( Behrman's equation ）：
Monte Carlo Algorithm( The monte carlo method ）： Calculate the value of the value function
Iterative Algorithm( Dynamic programming method ）： By iterating over the corresponding
Bellman Equation, Finally, make it converge , When the last updated state does not change much from the previous state ： Update stopped
Q function ：（active-value function)
Behrman equation in matrix form is difficult to solve
Calculate the Behrman equation ：
Monte Carlo ： When you get MRP, Let it put the boat in , Let him drift with the tide , Then a trajectory is generated , After generating a track , Get a reward , And then put it's discouneted The reward is calculated directly , Calculate it and accumulate , When accumulated to a certain number of tracks , Divide directly by this trajectory , Get its value
Dynamic programming ：
Combination of the two
Markov reward process （MRP） And Markov decision process （MDP） difference ：
MDP： More Decision, That's one more action, There is also one more state transition condition
There is a transformation relationship between the two ：MDP+policy=MRP
Looking for the best policy Method ：
Get the best value function – For this Q Function maximization – Get the best function
– Directly in this Q Take one from the function and let this Action Maximum value – Take out its best directly policy

Method ：
Exhaustive method （ Generally do not use ）
policy iteration: Optimize policy- Take out the value function – calculate Q function - Maximize
value iteration: Keep iterating Bellman Optimality Equation

原网站

版权声明
本文为[Infinite power]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221846584482.html

当前位置：网站首页>Exercises in Chapter II of intensive learning

Exercises in Chapter II of intensive learning

边栏推荐

猜你喜欢

随机推荐