当前位置：网站首页>3.1 Monte Carlo Methods ＆ case study: Blackjack of on-Policy Evaluation

3.1 Monte Carlo Methods ＆ case study: Blackjack of on-Policy Evaluation

2022-07-03 10:09:00 【Most appropriate commitment】

Catalog

Monte Carlo

Definition

Monte Carlo Prediction

Monte Carlo

Definition

When we do not know the environemnt, we could not use Dynamic Programming to get the value functions. But I f we could get lots of samples, we could get exact enough value functions by averging these data.

Monte Carlo Prediction

Definition

When we get the samples, we could know the sequence of

$S_0, A_0,R_1,S_1,A_1,R_2,S_2........S_{T-1},A_{T-1},R_T$ , then we could get returns by equation:

$G_t = R_{t=1}+\gamma*G_{t+1}$ .

So, we could average the samples of returns by averagng them in every state.

First-visit:

only average returns in state s when visiting s for the first time in this sample.

Every-visit:

average returns in state s when visiting s everytime in this sample.

Pesudocode

Initialization:

Returns = [ ]

Loop for N times:

generate an episide following the spesific policy

G = 0

loop for each step of episode, t = T-1, T-2,......,0

$G_t = R_{t+1} + \gamma G_{t+1}$

find the first visit of every s that appears in the episode and update G(s) ( first - visit monte carlo prediction )

update G(s) for every visit of every s that appears in the episode.

Blackjack

case

case analysis

S: 1. current sum number, Ace will be considered as 11, unless it will go bust.

2. rival's shown card number.

3. if Ace can be considered as 1 from 11.

A: hit or stick

R: no immediate reward, the terminal reward is 1, 0, -1 for win, draw, lose

Next S: sum number is below 21

our policy: hits unless 20, 21

rival's policy: hits unless 17 or more

PS: there is no consideration about the finite deck. We think it's a infinite deck ( with replacement )

Code

### settings

import math
import numpy
import random

# visualization 
import matplotlib 
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

UP_BOUNDARY = 21 ;

LIMITE_RIVAL = 17;

LIMITE_AGENT = 20 ;

MIN_S = 4;
MAX_S = 21; 

MIN_SHOWN_CARD = 1
MAX_SHOWN_CARD = 10

USEABLE_ACE = 1 ;
NON_USEABLE_ACE =0;

LOOP = 100000 ; 

gamma = 1;

### functions and classes 

# environment 

def get_one_more_card():
	card = random.randint(1,14);
	card = min(10, card)
	return card;


# the class of agent and rival 
class Agent_rival_class( ):
	
	def __init__(self):
		self.card = 0 ;
		self.ace_usable = 0 ;
		self.shown_card = 0 ;
		self.state = True ; # not go bust: True  ;  go bust: False
		self.cards=[];

		for i in range(0,2):
			card = get_one_more_card();
			self.cards.append(card);
# set the shown card
			if i==0: 
				self.shown_card = card ; 
# set the ace_usable state 

			if card == 1 and self.ace_usable == 0:
				card = 11 
				self.ace_usable = 1
			
			self.card += card 

### method 	
	def hit(self):
		"""  get one more card """
		card = get_one_more_card();
		self.cards.append(card);
		if self.ace_usable == 0:
			if card ==1 and self.card + 11 <= 21:
				self.ace_usable = 1;
				card = 11;
			
			self.card += card ; 
			
		else:
			if self.card + card <=21:
				self.card += card;
			else:
				self.card -= 10;
				self.card += card;
				self.ace_usable=0;
# check state
		if self.card <= 21:
			self.state = True ;
		else:
			self.state = False;
			
	def show_cards(self):
		print("sum of my cards is "+str(self.card));
		print("the cards I get are "+str(self.cards));
		print("the state is "+ str(self.state));

### main programming 


# initalization 

v = numpy.zeros((MAX_S-MIN_S+1, MAX_SHOWN_CARD-MIN_SHOWN_CARD+1, 2))
v_update_num = numpy.zeros((MAX_S-MIN_S+1, MAX_SHOWN_CARD-MIN_SHOWN_CARD+1, 2))

for loop in range(0,LOOP):
	
	S=[];

	R=[];
	agent = Agent_rival_class();
	rival = Agent_rival_class();
	
	if agent.card == 21 or rival.card ==21:
		continue;
	
	S.append((agent.card-MIN_S, rival.shown_card-MIN_SHOWN_CARD, agent.ace_usable))		
	
	
### get the sample
	
	while (agent.card < LIMITE_AGENT) :
		
		agent.hit();
		
		if agent.state == True:
		
			R.append(0);
			S.append((agent.card-MIN_S, rival.shown_card-MIN_SHOWN_CARD, agent.ace_usable));			
		else:
			R.append(-1);

	if agent.state == True:		
		while (rival.card < LIMITE_RIVAL ):
			rival.hit();
	
	if agent.state == True and rival.state == False :
		R.append(1);
	
	if agent.state == True and rival.state == True :
		if agent.card > rival.card:
			R.append(1);
		elif agent.card == rival.card:
			R.append(0);
		else:
			R.append(-1);
	
### sample ends 	

### update the returns 

	for j in range(1,len(S)+1):
		i=-j;
		if i == -1:
			#print(S[i])
			v[S[i]] = ( v_update_num[S[i]]*v[S[i]] + R[i] )/( v_update_num[S[i]]+1);
			v_update_num[S[i]] += 1;
		else:
			#print(S[i])
			v[S[i]] = ( v_update_num[S[i]]*v[S[i]] + R[i] + gamma*v[S[i+1]])/( v_update_num[S[i]]+1);
			v_update_num[S[i]] += 1;

### visualization 


v_0 = v[:,:,0]
v_1 = v[:,:,1]

fig, axes = plt.subplots(1,2)


xlabel=[]
ylabel=[]
for i in range(4,21+1):
	ylabel.append(str(i))

for j in range(1,10+1):
	xlabel.append(str(j))
	

axes[0].set_xticks(range(0,10,1))

axes[0].set_xticklabels(xlabel)

axes[0].set_yticks(range(0,18,1) )

axes[0].set_yticklabels(ylabel)

axes[0].set_title('when no usable Ace',fontsize=10)
im1 = axes[0].imshow(v_0,cmap=plt.cm.cool,vmin=-1, vmax=1) 



axes[1].set_xticks(range(0,10,1))

axes[1].set_xticklabels(xlabel)

axes[1].set_yticks(range(0,18,1) )

axes[1].set_yticklabels(ylabel)
axes[1].set_title('when having usable Ace',fontsize=10)
im2 = axes[1].imshow(v_1,cmap=plt.cm.cool,vmin=-1, vmax=1) 


fig = axes[1].figure
fig.suptitle('value function',fontsize=15)
fig.colorbar(im1,ax=axes.ravel().tolist())

plt.show()