当前位置:网站首页>2.2 DP: Value Iteration & Gambler‘s Problem
2.2 DP: Value Iteration & Gambler‘s Problem
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Value Iteration
Background
Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged.
Definition
We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.
In extreme situations, we can make policy improvement just after one update of one state.
Pesudo code
while judge == True:
judge = False;
for :
if :
judge = True;
# optimal policy
Gambler's Problem
Case
Case analysis
S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)
A: the money we stake: 0-min( s,100-s )
R: no imtermediate reward, expect v(100)=1
: S+A or S-A, which depends on the probability:
(win)
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
# global settings
MAX_MONEY = 100 ;
MIN_MONEY = 0 ;
P_H = 0.4 ;
gamma = 1 ; # episodic
# accuracy of value function
error = 0.001;
### functions
# max value function
def max_v_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return max(v_a)
# max value function index
def argmax_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return ( numpy.argmax(v_a) )
# visualization
def visualization(v_set,policy):
fig,axes = plt.subplots(2,1)
for i in range(0,len(v_set)):
axes[0].plot(v_set[i],linewidth=3)
# plt.pause(0.5)
axes[0].set_title('value function')
axes[1].plot(range(1,len(policy)+1),policy)
axes[1].set_title('policy')
plt.show()
### main programming
# policy
policy = numpy.zeros(MAX_MONEY-1);
# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);
#every_sweep_of value function
v_set = [] ;
# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;
judge = True;
# value iteration
while(judge):
judge = False
for s in range(MIN_MONEY+1,MAX_MONEY):
v_old = v[s];
v[s] = max_v_a(v,s);
if math.fabs( v[s] - v_old ) > error:
judge = True;
v_set.append(v.copy())
# optimal policy
for s in range(MIN_MONEY+1,MAX_MONEY):
policy[s-1] = argmax_a(v,s)
# visualization
visualization(v_set,policy)
Result
for =0.4
边栏推荐
- Serial communication based on 51 single chip microcomputer
- 使用密钥对的形式连接阿里云服务器
- Opencv note 21 frequency domain filtering
- Positive and negative sample division and architecture understanding in image classification and target detection
- CV learning notes - scale invariant feature transformation (SIFT)
- Pymssql controls SQL for Chinese queries
- Leetcode 300 最长上升子序列
- 01 business structure of imitation station B project
- 自動裝箱與拆箱了解嗎?原理是什麼?
- Adaptiveavgpool1d internal implementation
猜你喜欢
Adaptiveavgpool1d internal implementation
openEuler kernel 技术分享 - 第1期 - kdump 基本原理、使用及案例介绍
Open Euler Kernel Technology Sharing - Issue 1 - kdump Basic Principles, use and Case Introduction
Leetcode 300 longest ascending subsequence
Leetcode bit operation
LeetCode - 5 最长回文子串
pycharm 无法引入自定义包
Yocto technology sharing phase IV: customize and add software package support
Opencv notes 17 template matching
2312. Selling wood blocks | things about the interviewer and crazy Zhang San (leetcode, with mind map + all solutions)
随机推荐
CV learning notes - camera model (Euclidean transformation and affine transformation)
Opencv Harris corner detection
Screen display of charging pile design -- led driver ta6932
[combinatorics] Introduction to Combinatorics (combinatorial idea 3: upper and lower bound approximation | upper and lower bound approximation example Remsey number)
QT setting suspension button
Circular queue related design and implementation reference 1
For new students, if you have no contact with single-chip microcomputer, it is recommended to get started with 51 single-chip microcomputer
STM32 general timer 1s delay to realize LED flashing
51 MCU tmod and timer configuration
LeetCode - 508. Sum of subtree elements with the most occurrences (traversal of binary tree)
Replace the files under the folder with sed
Application of 51 single chip microcomputer timer
2021-01-03
Google browser plug-in recommendation
LeetCode - 1670 設計前中後隊列(設計 - 兩個雙端隊列)
JS基础-原型原型链和宏任务/微任务/事件机制
STM32 running lantern experiment - library function version
2021-10-27
The underlying principle of vector
Drive and control program of Dianchuan charging board for charging pile design