当前位置:网站首页>2.2 DP: Value Iteration & Gambler‘s Problem
2.2 DP: Value Iteration & Gambler‘s Problem
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Value Iteration
Background
Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged.
Definition
We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.
In extreme situations, we can make policy improvement just after one update of one state.
Pesudo code
while judge == True:
judge = False;
for
:

![v(s)\ =\ max_a \sum_{s^{'},r^{'}} \ p(s^{'},r^{'}|s,a)[r+\gamma v(s^{'})]](http://img.inotgo.com/imagesLocal/202202/15/202202150539010434_2.gif)
if
:
judge = True;
# optimal policy
![\pi(s)\ =\ argmax_a \ sum_{s^{'},a^{'}} \ p(s^{'},a^{'}|s,a)[r+\gamma v(s^{'})]](http://img.inotgo.com/imagesLocal/202202/15/202202150539010434_5.gif)
Gambler's Problem
Case

Case analysis
S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)
A: the money we stake: 0-min( s,100-s )
R: no imtermediate reward, expect v(100)=1
: S+A or S-A, which depends on the probability:
(win)
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
# global settings
MAX_MONEY = 100 ;
MIN_MONEY = 0 ;
P_H = 0.4 ;
gamma = 1 ; # episodic
# accuracy of value function
error = 0.001;
### functions
# max value function
def max_v_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return max(v_a)
# max value function index
def argmax_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return ( numpy.argmax(v_a) )
# visualization
def visualization(v_set,policy):
fig,axes = plt.subplots(2,1)
for i in range(0,len(v_set)):
axes[0].plot(v_set[i],linewidth=3)
# plt.pause(0.5)
axes[0].set_title('value function')
axes[1].plot(range(1,len(policy)+1),policy)
axes[1].set_title('policy')
plt.show()
### main programming
# policy
policy = numpy.zeros(MAX_MONEY-1);
# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);
#every_sweep_of value function
v_set = [] ;
# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;
judge = True;
# value iteration
while(judge):
judge = False
for s in range(MIN_MONEY+1,MAX_MONEY):
v_old = v[s];
v[s] = max_v_a(v,s);
if math.fabs( v[s] - v_old ) > error:
judge = True;
v_set.append(v.copy())
# optimal policy
for s in range(MIN_MONEY+1,MAX_MONEY):
policy[s-1] = argmax_a(v,s)
# visualization
visualization(v_set,policy)
Result
for
=0.4

边栏推荐
- Timer and counter of 51 single chip microcomputer
- 4G module at command communication package interface designed by charging pile
- Yocto technology sharing phase IV: customize and add software package support
- CV learning notes - scale invariant feature transformation (SIFT)
- Leetcode 300 最长上升子序列
- LeetCode - 460 LFU 缓存(设计 - 哈希表+双向链表 哈希表+平衡二叉树(TreeSet))*
- Pycharm cannot import custom package
- Emballage automatique et déballage compris? Quel est le principe?
- (1) 什么是Lambda表达式
- 4G module board level control interface designed by charging pile
猜你喜欢

el-table X轴方向(横向)滚动条默认滑到右边

使用密钥对的形式连接阿里云服务器

Basic use and actual combat sharing of crash tool

Dictionary tree prefix tree trie

LeetCode - 900. RLE 迭代器

Leetcode interview question 17.20 Continuous median (large top pile + small top pile)

CV learning notes - clustering

Vgg16 migration learning source code

LeetCode - 5 最长回文子串

Pymssql controls SQL for Chinese queries
随机推荐
01仿B站项目业务架构
自動裝箱與拆箱了解嗎?原理是什麼?
[untitled] proteus simulation of traffic lights based on 89C51 Single Chip Microcomputer
2. Elment UI date selector formatting problem
LeetCode - 705 设计哈希集合(设计)
[Li Kou brush question notes (II)] special skills, module breakthroughs, classification and summary of 45 classic questions, and refinement in continuous consolidation
Retinaface: single stage dense face localization in the wild
LeetCode - 706 设计哈希映射(设计) *
4G module IMEI of charging pile design
Dynamic layout management
byte alignment
Design of charging pile mqtt transplantation based on 4G EC20 module
LeetCode 面试题 17.20. 连续中值(大顶堆+小顶堆)
Dictionary tree prefix tree trie
Circular queue related design and implementation reference 1
openEuler kernel 技术分享 - 第1期 - kdump 基本原理、使用及案例介绍
使用sed替换文件夹下文件
Crash工具基本使用及实战分享
Serial port programming
JS foundation - prototype prototype chain and macro task / micro task / event mechanism