当前位置:网站首页>3.1 Monte Carlo Methods & case study: Blackjack of on-Policy Evaluation
3.1 Monte Carlo Methods & case study: Blackjack of on-Policy Evaluation
2022-07-03 10:09:00 【Most appropriate commitment】
Catalog
Monte Carlo
Definition
When we do not know the environemnt, we could not use Dynamic Programming to get the value functions. But I f we could get lots of samples, we could get exact enough value functions by averging these data.
Monte Carlo Prediction
Definition
When we get the samples, we could know the sequence of
, then we could get returns by equation:
.
So, we could average the samples of returns by averagng them in every state.
First-visit:
only average returns in state s when visiting s for the first time in this sample.
Every-visit:
average returns in state s when visiting s everytime in this sample.
Pesudocode
Initialization:
Returns = [ ]
Loop for N times:
generate an episide following the spesific policy
G = 0
loop for each step of episode, t = T-1, T-2,......,0

find the first visit of every s that appears in the episode and update
( first - visit monte carlo prediction )
update
for every visit of every s that appears in the episode.
Blackjack
case

case analysis
S: 1. current sum number, Ace will be considered as 11, unless it will go bust.
2. rival's shown card number.
3. if Ace can be considered as 1 from 11.
A: hit or stick
R: no immediate reward, the terminal reward is 1, 0, -1 for win, draw, lose
Next S: sum number is below 21
our policy: hits unless 20, 21
rival's policy: hits unless 17 or more
PS: there is no consideration about the finite deck. We think it's a infinite deck ( with replacement )
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
UP_BOUNDARY = 21 ;
LIMITE_RIVAL = 17;
LIMITE_AGENT = 20 ;
MIN_S = 4;
MAX_S = 21;
MIN_SHOWN_CARD = 1
MAX_SHOWN_CARD = 10
USEABLE_ACE = 1 ;
NON_USEABLE_ACE =0;
LOOP = 100000 ;
gamma = 1;
### functions and classes
# environment
def get_one_more_card():
card = random.randint(1,14);
card = min(10, card)
return card;
# the class of agent and rival
class Agent_rival_class( ):
def __init__(self):
self.card = 0 ;
self.ace_usable = 0 ;
self.shown_card = 0 ;
self.state = True ; # not go bust: True ; go bust: False
self.cards=[];
for i in range(0,2):
card = get_one_more_card();
self.cards.append(card);
# set the shown card
if i==0:
self.shown_card = card ;
# set the ace_usable state
if card == 1 and self.ace_usable == 0:
card = 11
self.ace_usable = 1
self.card += card
### method
def hit(self):
""" get one more card """
card = get_one_more_card();
self.cards.append(card);
if self.ace_usable == 0:
if card ==1 and self.card + 11 <= 21:
self.ace_usable = 1;
card = 11;
self.card += card ;
else:
if self.card + card <=21:
self.card += card;
else:
self.card -= 10;
self.card += card;
self.ace_usable=0;
# check state
if self.card <= 21:
self.state = True ;
else:
self.state = False;
def show_cards(self):
print("sum of my cards is "+str(self.card));
print("the cards I get are "+str(self.cards));
print("the state is "+ str(self.state));
### main programming
# initalization
v = numpy.zeros((MAX_S-MIN_S+1, MAX_SHOWN_CARD-MIN_SHOWN_CARD+1, 2))
v_update_num = numpy.zeros((MAX_S-MIN_S+1, MAX_SHOWN_CARD-MIN_SHOWN_CARD+1, 2))
for loop in range(0,LOOP):
S=[];
R=[];
agent = Agent_rival_class();
rival = Agent_rival_class();
if agent.card == 21 or rival.card ==21:
continue;
S.append((agent.card-MIN_S, rival.shown_card-MIN_SHOWN_CARD, agent.ace_usable))
### get the sample
while (agent.card < LIMITE_AGENT) :
agent.hit();
if agent.state == True:
R.append(0);
S.append((agent.card-MIN_S, rival.shown_card-MIN_SHOWN_CARD, agent.ace_usable));
else:
R.append(-1);
if agent.state == True:
while (rival.card < LIMITE_RIVAL ):
rival.hit();
if agent.state == True and rival.state == False :
R.append(1);
if agent.state == True and rival.state == True :
if agent.card > rival.card:
R.append(1);
elif agent.card == rival.card:
R.append(0);
else:
R.append(-1);
### sample ends
### update the returns
for j in range(1,len(S)+1):
i=-j;
if i == -1:
#print(S[i])
v[S[i]] = ( v_update_num[S[i]]*v[S[i]] + R[i] )/( v_update_num[S[i]]+1);
v_update_num[S[i]] += 1;
else:
#print(S[i])
v[S[i]] = ( v_update_num[S[i]]*v[S[i]] + R[i] + gamma*v[S[i+1]])/( v_update_num[S[i]]+1);
v_update_num[S[i]] += 1;
### visualization
v_0 = v[:,:,0]
v_1 = v[:,:,1]
fig, axes = plt.subplots(1,2)
xlabel=[]
ylabel=[]
for i in range(4,21+1):
ylabel.append(str(i))
for j in range(1,10+1):
xlabel.append(str(j))
axes[0].set_xticks(range(0,10,1))
axes[0].set_xticklabels(xlabel)
axes[0].set_yticks(range(0,18,1) )
axes[0].set_yticklabels(ylabel)
axes[0].set_title('when no usable Ace',fontsize=10)
im1 = axes[0].imshow(v_0,cmap=plt.cm.cool,vmin=-1, vmax=1)
axes[1].set_xticks(range(0,10,1))
axes[1].set_xticklabels(xlabel)
axes[1].set_yticks(range(0,18,1) )
axes[1].set_yticklabels(ylabel)
axes[1].set_title('when having usable Ace',fontsize=10)
im2 = axes[1].imshow(v_1,cmap=plt.cm.cool,vmin=-1, vmax=1)
fig = axes[1].figure
fig.suptitle('value function',fontsize=15)
fig.colorbar(im1,ax=axes.ravel().tolist())
plt.show()
Result

边栏推荐
- Of course, the most widely used 8-bit single chip microcomputer is also the single chip microcomputer that beginners are most easy to learn
- STM32 interrupt switch
- Basic knowledge of communication interface
- Opencv feature extraction - hog
- el-table X轴方向(横向)滚动条默认滑到右边
- (1) What is a lambda expression
- The data read by pandas is saved to the MySQL database
- LeetCode - 508. Sum of subtree elements with the most occurrences (traversal of binary tree)
- Simulate mouse click
- 01 business structure of imitation station B project
猜你喜欢

Opencv feature extraction - hog

LeetCode - 1670 设计前中后队列(设计 - 两个双端队列)

LeetCode - 5 最长回文子串

openEuler kernel 技術分享 - 第1期 - kdump 基本原理、使用及案例介紹

LeetCode - 673. 最长递增子序列的个数

Opencv image rotation

LeetCode - 703 数据流中的第 K 大元素(设计 - 优先队列)

CV learning notes - edge extraction

Interruption system of 51 single chip microcomputer

LeetCode - 1670 設計前中後隊列(設計 - 兩個雙端隊列)
随机推荐
Wireshark use
4G module designed by charging pile obtains signal strength and quality
Assignment to '*' form incompatible pointer type 'linkstack' {aka '*'} problem solving
2312. Selling wood blocks | things about the interviewer and crazy Zhang San (leetcode, with mind map + all solutions)
LeetCode - 5 最长回文子串
Modelcheckpoint auto save model
The data read by pandas is saved to the MySQL database
CV learning notes - camera model (Euclidean transformation and affine transformation)
QT self drawing button with bubbles
QT is a method of batch modifying the style of a certain type of control after naming the control
Leetcode interview question 17.20 Continuous median (large top pile + small top pile)
Opencv gray histogram, histogram specification
Pycharm cannot import custom package
Vscode markdown export PDF error
QT detection card reader analog keyboard input
MySQL root user needs sudo login
LeetCode - 715. Range 模块(TreeSet) *****
Opencv notes 17 template matching
LeetCode - 706 设计哈希映射(设计) *
Drive and control program of Dianchuan charging board for charging pile design