当前位置:网站首页>2.2 DP: Value Iteration & Gambler‘s Problem
2.2 DP: Value Iteration & Gambler‘s Problem
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Value Iteration
Background
Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged.
Definition
We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.
In extreme situations, we can make policy improvement just after one update of one state.
Pesudo code
while judge == True:
judge = False;
for
:

![v(s)\ =\ max_a \sum_{s^{'},r^{'}} \ p(s^{'},r^{'}|s,a)[r+\gamma v(s^{'})]](http://img.inotgo.com/imagesLocal/202202/15/202202150539010434_2.gif)
if
:
judge = True;
# optimal policy
![\pi(s)\ =\ argmax_a \ sum_{s^{'},a^{'}} \ p(s^{'},a^{'}|s,a)[r+\gamma v(s^{'})]](http://img.inotgo.com/imagesLocal/202202/15/202202150539010434_5.gif)
Gambler's Problem
Case

Case analysis
S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)
A: the money we stake: 0-min( s,100-s )
R: no imtermediate reward, expect v(100)=1
: S+A or S-A, which depends on the probability:
(win)
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
# global settings
MAX_MONEY = 100 ;
MIN_MONEY = 0 ;
P_H = 0.4 ;
gamma = 1 ; # episodic
# accuracy of value function
error = 0.001;
### functions
# max value function
def max_v_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return max(v_a)
# max value function index
def argmax_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return ( numpy.argmax(v_a) )
# visualization
def visualization(v_set,policy):
fig,axes = plt.subplots(2,1)
for i in range(0,len(v_set)):
axes[0].plot(v_set[i],linewidth=3)
# plt.pause(0.5)
axes[0].set_title('value function')
axes[1].plot(range(1,len(policy)+1),policy)
axes[1].set_title('policy')
plt.show()
### main programming
# policy
policy = numpy.zeros(MAX_MONEY-1);
# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);
#every_sweep_of value function
v_set = [] ;
# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;
judge = True;
# value iteration
while(judge):
judge = False
for s in range(MIN_MONEY+1,MAX_MONEY):
v_old = v[s];
v[s] = max_v_a(v,s);
if math.fabs( v[s] - v_old ) > error:
judge = True;
v_set.append(v.copy())
# optimal policy
for s in range(MIN_MONEY+1,MAX_MONEY):
policy[s-1] = argmax_a(v,s)
# visualization
visualization(v_set,policy)
Result
for
=0.4

边栏推荐
- Gpiof6, 7, 8 configuration
- Not many people can finally bring their interests to college graduation
- YOLO_ V1 summary
- Emballage automatique et déballage compris? Quel est le principe?
- yocto 技术分享第四期:自定义增加软件包支持
- QT self drawing button with bubbles
- Markdown latex full quantifier and existential quantifier (for all, existential)
- Swing transformer details-2
- QT setting suspension button
- 03 FastJson 解决循环引用
猜你喜欢

Leetcode 300 longest ascending subsequence

openEuler kernel 技术分享 - 第1期 - kdump 基本原理、使用及案例介绍

CV learning notes - clustering

LeetCode - 508. 出现次数最多的子树元素和 (二叉树的遍历)

Blue Bridge Cup for migrant workers majoring in electronic information engineering

LeetCode - 900. RLE 迭代器

SCM is now overwhelming, a wide variety, so that developers are overwhelmed

For new students, if you have no contact with single-chip microcomputer, it is recommended to get started with 51 single-chip microcomputer

Gpiof6, 7, 8 configuration

Pycharm cannot import custom package
随机推荐
JS foundation - prototype prototype chain and macro task / micro task / event mechanism
Markdown latex full quantifier and existential quantifier (for all, existential)
Assignment to '*' form incompatible pointer type 'linkstack' {aka '*'} problem solving
QT detection card reader analog keyboard input
QT setting suspension button
The new series of MCU also continues the two advantages of STM32 product family: low voltage and energy saving
使用sed替换文件夹下文件
LeetCode - 460 LFU 缓存(设计 - 哈希表+双向链表 哈希表+平衡二叉树(TreeSet))*
Google browser plug-in recommendation
(1) 什么是Lambda表达式
Crash工具基本使用及实战分享
Dictionary tree prefix tree trie
A lottery like scissors, stone and cloth (C language)
Yocto Technology Sharing Phase 4: Custom add package support
Of course, the most widely used 8-bit single chip microcomputer is also the single chip microcomputer that beginners are most easy to learn
Notes on C language learning of migrant workers majoring in electronic information engineering
(1) What is a lambda expression
Opencv notes 20 PCA
03 fastjason solves circular references
2020-08-23