当前位置:网站首页>Monte Carlo search tree (UCT) based on confidence upper bound to realize four sub chess

Monte Carlo search tree (UCT) based on confidence upper bound to realize four sub chess

2022-07-26 22:39:00 biyezuopin

Monte Carlo search tree based on upper bound of confidence (UCT) Realize four chess

Algorithm

This experiment adopts Monte Carlo search tree with upper bound of confidence (UCT) Realization , The algorithm is briefly described as follows :

Each node in the search tree represents a situation . For a node x,x Of the i Child node

 Insert picture description here

Representation node x The situation is in the i The next situation formed after the listing . Search with the current situation as the root . For each node , Count all games based on the situation represented by this node and its child nodes , Record the total number of games n And the total number of wins m, Then the estimation of the winning probability of this node is

 Insert picture description here

. In order to estimate the exact probability of winning , Enough game experiments are needed . Every experiment starts from the situation represented by the roots , Implement the following strategies :

If the current situation in the search tree has been node x Express , And all the sub situations produced by this situation in one step , All have been x The child nodes of represent , select UCB The drop corresponding to the largest child node in the upper bound of confidence . If you remember the node x The total number of trials and the number of wins recorded are n and m,x Child nodes of

 Insert picture description here

 Insert picture description here

among c Is the scale factor , In this study c=1, The specific value will be discussed later . here “ win victory ” It refers to the player represented by the current node ;

If the current situation in the search tree has been node x Express , But the secondary situation caused by a certain end of this situation , No reason x The child nodes of represent , Then choose to execute this settlement , And expand new nodes on the tree to record this new situation ;

If the current situation is not represented by any node in the search tree , Then perform a random drop . but ① If you execute a certain position, you can win directly , or ② If you don't implement a certain fall, you will fall directly , Then execute this settlement .

Update the information recorded by the nodes on the tree after each test . After performing enough tests , Select all child nodes of the root node to estimate the winning rate

 Insert picture description here

Will converge to the real winning rate , Choose the biggest one as the formal decision .

UCT The algorithm balances exploration and utilization : When the number of attempts of a strategy is small , Its upper bound of confidence is high , The algorithm will try enough strange decisions ( Explore ); When a strategy is estimated to have a high winning rate , The upper limit of confidence is also high , The algorithm will focus on these strategies , Avoid wasting too much computation in a situation where the winning rate is too small ( utilize ).

The above algorithm is not naive UCT Algorithm , There are two improvements worth noting :

  • If the node x No child nodes yet , Not all its child nodes are expanded at one time , Instead, it expands one by one ( See steps above 2), Avoid wasting too much computation at one time when expanding a node .
  • step 3 Not a random strategy , But can perceive the situation that the next step is bound to win or lose . It doesn't take much calculation to decide whether to win or lose , But this action effectively avoids the blindness of pure random strategy .

Measurement and evaluation

To evaluate this algorithm , The following measurements were made . Although the random noise is large , But it can also explain some problems .

Monte Carlo algorithm requires a certain number of experiments , Estimate the winning rate

 Insert picture description here

To converge to the real winning rate . In order to analyze whether the number of experiments performed by the algorithm is sufficient , Select several references AI Play chess with it , Each group of data is measured 20 Round game , Half of them are pioneers , The other half is the backhand . Measure program run time ( Proportional to the number of tests ) The relationship with the winning rate is as follows :

reference AI;; The elapsed time 10.dll20.dll30.dll40.dll50.dll60.dll70.dll80.dll90.dll100.dll
0.5s11N/A10.950.950.950.850.950.85
1.0s11N/A10.951110.950.65
1.5s11N/A0.9110.95110.6
2.0s10.95N/A110.950.85110.75
2.5s10.95N/A110.9510.9510.75

It can be seen that ,0.5s~2.5s Within the scope of , There is no statistical difference between the winning rate and the running time . It can be considered that the number of tests is sufficient .

UCB There is a proportional coefficient in the upper bound of confidence c, To determine the best c Value , Make the following measurements . Still measure each group of data 20 Round game , Half of them are pioneers , The other half is the backhand .

reference AI;c70.dll80.dll90.dll100.dll
0.100.10.050
0.40.910.850.55
0.70.950.9510.75
1.010.850.950.8
1.31110.7
1.610.950.950.5
1.90.950.90.750.6

The plot :

 Insert picture description here

It can be seen that c≤0.5 or c≥1.5 The winning rate decreased significantly , but 0.5<c<1.5 There is no significant difference in the winning rate . In this experiment, we choose c=1 As the final parameter .

Other attempts to improve

Except in “ Algorithm ” In addition to the two improvements described in the section , I also tried other improvement methods as follows , But experiments show that the effect is not good , Therefore, it has not been adopted in the final version .

UCT One disadvantage of the algorithm is : A better decision must be tried enough times , Its value can be reflected in the estimated winning rate

 Insert picture description here

in , A lot of computation is wasted . This is because the winning rate of a node is obtained by the weighted average of the winning rates of all its child nodes , Instead of the maximum winning rate of its child nodes .

I try to set the winning probability of a node as a random variable , Suppose it follows a normal distribution N(μ,σ), With μ+σ As an alternative, the upper bound of confidence is searched by Monte Carlo . The winning probability of each node is defined as the maximum possible winning probability of its child nodes . Although the distribution to which the maximum of some normal distributions obey has no analytical solution , But it can be fitted by simple function , Through this method, the winning probability of each node in the tree can be estimated , Finally, choose the decision with the greatest expectation of winning .

Experiments show that , This method is not as successful as the naive Algorithm . I think it may be caused by the following reasons :

  • “ Measurement and evaluation ” The analysis in the first section shows , The number of attempts in the naive algorithm is enough ,UCT The amount of wasted computation has not had too many negative effects ;
  • The maximum values of some normal distribution variables no longer obey the normal distribution , According to this method, it is estimated that there will be deviation .

Description of code implementation

In the code ,Board Class is used to manage the current situation , Maintain the location and top Location information ;Search Class is used to execute UCT Search for .Strategy.cpp For interacting with the framework .
Method estimation will produce deviation .

Description of code implementation

In the code ,Board Class is used to manage the current situation , Maintain the location and top Location information ;Search Class is used to execute UCT Search for .Strategy.cpp For interacting with the framework .

原网站

版权声明
本文为[biyezuopin]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/207/202207262149455411.html