当前位置:网站首页>2.2 DP: Value Iteration & Gambler‘s Problem
2.2 DP: Value Iteration & Gambler‘s Problem
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Value Iteration
Background
Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged.
Definition
We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.
In extreme situations, we can make policy improvement just after one update of one state.
Pesudo code
while judge == True:
judge = False;
for
:

![v(s)\ =\ max_a \sum_{s^{'},r^{'}} \ p(s^{'},r^{'}|s,a)[r+\gamma v(s^{'})]](http://img.inotgo.com/imagesLocal/202202/15/202202150539010434_2.gif)
if
:
judge = True;
# optimal policy
![\pi(s)\ =\ argmax_a \ sum_{s^{'},a^{'}} \ p(s^{'},a^{'}|s,a)[r+\gamma v(s^{'})]](http://img.inotgo.com/imagesLocal/202202/15/202202150539010434_5.gif)
Gambler's Problem
Case

Case analysis
S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)
A: the money we stake: 0-min( s,100-s )
R: no imtermediate reward, expect v(100)=1
: S+A or S-A, which depends on the probability:
(win)
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
# global settings
MAX_MONEY = 100 ;
MIN_MONEY = 0 ;
P_H = 0.4 ;
gamma = 1 ; # episodic
# accuracy of value function
error = 0.001;
### functions
# max value function
def max_v_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return max(v_a)
# max value function index
def argmax_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return ( numpy.argmax(v_a) )
# visualization
def visualization(v_set,policy):
fig,axes = plt.subplots(2,1)
for i in range(0,len(v_set)):
axes[0].plot(v_set[i],linewidth=3)
# plt.pause(0.5)
axes[0].set_title('value function')
axes[1].plot(range(1,len(policy)+1),policy)
axes[1].set_title('policy')
plt.show()
### main programming
# policy
policy = numpy.zeros(MAX_MONEY-1);
# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);
#every_sweep_of value function
v_set = [] ;
# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;
judge = True;
# value iteration
while(judge):
judge = False
for s in range(MIN_MONEY+1,MAX_MONEY):
v_old = v[s];
v[s] = max_v_a(v,s);
if math.fabs( v[s] - v_old ) > error:
judge = True;
v_set.append(v.copy())
# optimal policy
for s in range(MIN_MONEY+1,MAX_MONEY):
policy[s-1] = argmax_a(v,s)
# visualization
visualization(v_set,policy)
Result
for
=0.4

边栏推荐
- Leetcode - 1670 conception de la file d'attente avant, moyenne et arrière (conception - deux files d'attente à double extrémité)
- 2021-01-03
- is_ power_ of_ 2 judge whether it is a multiple of 2
- A lottery like scissors, stone and cloth (C language)
- LeetCode - 460 LFU 缓存(设计 - 哈希表+双向链表 哈希表+平衡二叉树(TreeSet))*
- 01仿B站项目业务架构
- ADS simulation design of class AB RF power amplifier
- Exception handling of arm
- Working mode of 80C51 Serial Port
- MySQL root user needs sudo login
猜你喜欢

Connect Alibaba cloud servers in the form of key pairs

2021-10-27

JS基础-原型原型链和宏任务/微任务/事件机制

LeetCode - 895 最大频率栈(设计- 哈希表+优先队列 哈希表 + 栈) *
![[Li Kou brush question notes (II)] special skills, module breakthroughs, classification and summary of 45 classic questions, and refinement in continuous consolidation](/img/06/7fd985faf8806ceface3864d4b3180.png)
[Li Kou brush question notes (II)] special skills, module breakthroughs, classification and summary of 45 classic questions, and refinement in continuous consolidation

JS foundation - prototype prototype chain and macro task / micro task / event mechanism

Pycharm cannot import custom package

Octave instructions

Swing transformer details-1

Assignment to '*' form incompatible pointer type 'linkstack' {aka '*'} problem solving
随机推荐
Dictionary tree prefix tree trie
LeetCode - 673. Number of longest increasing subsequences
LeetCode - 933 最近的请求次数
Gpiof6, 7, 8 configuration
LeetCode 面试题 17.20. 连续中值(大顶堆+小顶堆)
SCM is now overwhelming, a wide variety, so that developers are overwhelmed
Blue Bridge Cup for migrant workers majoring in electronic information engineering
Opencv gray histogram, histogram specification
Openeuler kernel technology sharing - Issue 1 - kdump basic principle, use and case introduction
LeetCode - 706 设计哈希映射(设计) *
Assignment to '*' form incompatible pointer type 'linkstack' {aka '*'} problem solving
LeetCode - 1172 餐盘栈 (设计 - List + 小顶堆 + 栈))
el-table X轴方向(横向)滚动条默认滑到右边
YOLO_ V1 summary
Google browser plug-in recommendation
LeetCode - 715. Range 模块(TreeSet) *****
2021-10-27
Opencv note 21 frequency domain filtering
Pymssql controls SQL for Chinese queries
Tensorflow2.0 save model