当前位置:网站首页>3.1 Monte Carlo Methods & case study: Blackjack of on-Policy Evaluation
3.1 Monte Carlo Methods & case study: Blackjack of on-Policy Evaluation
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Monte Carlo
Definition
When we do not know the environemnt, we could not use Dynamic Programming to get the value functions. But I f we could get lots of samples, we could get exact enough value functions by averging these data.
Monte Carlo Prediction
Definition
When we get the samples, we could know the sequence of
, then we could get returns by equation:
.
So, we could average the samples of returns by averagng them in every state.
First-visit:
only average returns in state s when visiting s for the first time in this sample.
Every-visit:
average returns in state s when visiting s everytime in this sample.
Pesudocode
Initialization:
Returns = [ ]
Loop for N times:
generate an episide following the spesific policy
G = 0
loop for each step of episode, t = T-1, T-2,......,0

find the first visit of every s that appears in the episode and update
( first - visit monte carlo prediction )
update
for every visit of every s that appears in the episode.
Blackjack
case

case analysis
S: 1. current sum number, Ace will be considered as 11, unless it will go bust.
2. rival's shown card number.
3. if Ace can be considered as 1 from 11.
A: hit or stick
R: no immediate reward, the terminal reward is 1, 0, -1 for win, draw, lose
Next S: sum number is below 21
our policy: hits unless 20, 21
rival's policy: hits unless 17 or more
PS: there is no consideration about the finite deck. We think it's a infinite deck ( with replacement )
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
UP_BOUNDARY = 21 ;
LIMITE_RIVAL = 17;
LIMITE_AGENT = 20 ;
MIN_S = 4;
MAX_S = 21;
MIN_SHOWN_CARD = 1
MAX_SHOWN_CARD = 10
USEABLE_ACE = 1 ;
NON_USEABLE_ACE =0;
LOOP = 100000 ;
gamma = 1;
### functions and classes
# environment
def get_one_more_card():
card = random.randint(1,14);
card = min(10, card)
return card;
# the class of agent and rival
class Agent_rival_class( ):
def __init__(self):
self.card = 0 ;
self.ace_usable = 0 ;
self.shown_card = 0 ;
self.state = True ; # not go bust: True ; go bust: False
self.cards=[];
for i in range(0,2):
card = get_one_more_card();
self.cards.append(card);
# set the shown card
if i==0:
self.shown_card = card ;
# set the ace_usable state
if card == 1 and self.ace_usable == 0:
card = 11
self.ace_usable = 1
self.card += card
### method
def hit(self):
""" get one more card """
card = get_one_more_card();
self.cards.append(card);
if self.ace_usable == 0:
if card ==1 and self.card + 11 <= 21:
self.ace_usable = 1;
card = 11;
self.card += card ;
else:
if self.card + card <=21:
self.card += card;
else:
self.card -= 10;
self.card += card;
self.ace_usable=0;
# check state
if self.card <= 21:
self.state = True ;
else:
self.state = False;
def show_cards(self):
print("sum of my cards is "+str(self.card));
print("the cards I get are "+str(self.cards));
print("the state is "+ str(self.state));
### main programming
# initalization
v = numpy.zeros((MAX_S-MIN_S+1, MAX_SHOWN_CARD-MIN_SHOWN_CARD+1, 2))
v_update_num = numpy.zeros((MAX_S-MIN_S+1, MAX_SHOWN_CARD-MIN_SHOWN_CARD+1, 2))
for loop in range(0,LOOP):
S=[];
R=[];
agent = Agent_rival_class();
rival = Agent_rival_class();
if agent.card == 21 or rival.card ==21:
continue;
S.append((agent.card-MIN_S, rival.shown_card-MIN_SHOWN_CARD, agent.ace_usable))
### get the sample
while (agent.card < LIMITE_AGENT) :
agent.hit();
if agent.state == True:
R.append(0);
S.append((agent.card-MIN_S, rival.shown_card-MIN_SHOWN_CARD, agent.ace_usable));
else:
R.append(-1);
if agent.state == True:
while (rival.card < LIMITE_RIVAL ):
rival.hit();
if agent.state == True and rival.state == False :
R.append(1);
if agent.state == True and rival.state == True :
if agent.card > rival.card:
R.append(1);
elif agent.card == rival.card:
R.append(0);
else:
R.append(-1);
### sample ends
### update the returns
for j in range(1,len(S)+1):
i=-j;
if i == -1:
#print(S[i])
v[S[i]] = ( v_update_num[S[i]]*v[S[i]] + R[i] )/( v_update_num[S[i]]+1);
v_update_num[S[i]] += 1;
else:
#print(S[i])
v[S[i]] = ( v_update_num[S[i]]*v[S[i]] + R[i] + gamma*v[S[i+1]])/( v_update_num[S[i]]+1);
v_update_num[S[i]] += 1;
### visualization
v_0 = v[:,:,0]
v_1 = v[:,:,1]
fig, axes = plt.subplots(1,2)
xlabel=[]
ylabel=[]
for i in range(4,21+1):
ylabel.append(str(i))
for j in range(1,10+1):
xlabel.append(str(j))
axes[0].set_xticks(range(0,10,1))
axes[0].set_xticklabels(xlabel)
axes[0].set_yticks(range(0,18,1) )
axes[0].set_yticklabels(ylabel)
axes[0].set_title('when no usable Ace',fontsize=10)
im1 = axes[0].imshow(v_0,cmap=plt.cm.cool,vmin=-1, vmax=1)
axes[1].set_xticks(range(0,10,1))
axes[1].set_xticklabels(xlabel)
axes[1].set_yticks(range(0,18,1) )
axes[1].set_yticklabels(ylabel)
axes[1].set_title('when having usable Ace',fontsize=10)
im2 = axes[1].imshow(v_1,cmap=plt.cm.cool,vmin=-1, vmax=1)
fig = axes[1].figure
fig.suptitle('value function',fontsize=15)
fig.colorbar(im1,ax=axes.ravel().tolist())
plt.show()
Result

边栏推荐
- 2021-11-11 standard thread library
- Pymssql controls SQL for Chinese queries
- Leetcode - 895 maximum frequency stack (Design - hash table + priority queue hash table + stack)*
- Stm32f04 clock configuration
- Working mode of 80C51 Serial Port
- 2021-10-27
- Tensorflow built-in evaluation
- MySQL root user needs sudo login
- [combinatorics] Introduction to Combinatorics (combinatorial idea 3: upper and lower bound approximation | upper and lower bound approximation example Remsey number)
- (2) New methods in the interface
猜你喜欢

openEuler kernel 技術分享 - 第1期 - kdump 基本原理、使用及案例介紹

Gpiof6, 7, 8 configuration

LeetCode - 673. 最长递增子序列的个数

LeetCode - 895 最大频率栈(设计- 哈希表+优先队列 哈希表 + 栈) *

Development of intelligent charging pile (I): overview of the overall design of the system

CV learning notes - camera model (Euclidean transformation and affine transformation)

2312. Selling wood blocks | things about the interviewer and crazy Zhang San (leetcode, with mind map + all solutions)

Opencv histogram equalization

There is no shortcut to learning and development, and there is almost no situation that you can learn faster by leading the way

Assignment to '*' form incompatible pointer type 'linkstack' {aka '*'} problem solving
随机推荐
Retinaface: single stage dense face localization in the wild
It is difficult to quantify the extent to which a single-chip computer can find a job
2021-10-28
LeetCode - 673. Number of longest increasing subsequences
Opencv feature extraction sift
Cases of OpenCV image enhancement
使用sed替换文件夹下文件
4G module board level control interface designed by charging pile
I think all friends should know that the basic law of learning is: from easy to difficult
Vscode markdown export PDF error
Connect Alibaba cloud servers in the form of key pairs
Opencv notes 20 PCA
Stm32 NVIC interrupt priority management
RESNET code details
LeetCode - 703 数据流中的第 K 大元素(设计 - 优先队列)
Simulate mouse click
STM32 general timer 1s delay to realize LED flashing
Liquid crystal display
[combinatorics] Introduction to Combinatorics (combinatorial idea 3: upper and lower bound approximation | upper and lower bound approximation example Remsey number)
Application of 51 single chip microcomputer timer