MCTS Parameters

MCTS Parameters

TAG currently supports the following variants/parameterisations of MCTS within MCTSParams (as of March 2022). Given the options, it is recommended that an MCTS agent be defined by a JSON file, in which values for each of these can be specified. For more details and examples of specifying an MCTS agent in JSON, see this documentation.

K

The weighting to give the the exploration term in the UCB equation. This should be scaled to the game score - if the result of a game is 1 for a win, and 0 or -1 for a draw then a value of about 1.0 or sqrt(2) is the standard choice. This scaling to the score will take place automatically if the normaliseRewards parameter is true (which is the default). In this case the largest and smallest final scores seen on all rollouts is tracked during each MCTS iteration, and used to scale all future rewards to the range [0, 1].

rolloutLength

The maximum number of actions to take during the rollout phase of MCTS, i.e. once we leave the tree.

rolloutTermination

This has four possible values. If rolloutLength is greater than zero, then the rollout will run for this number of actions, and then terminate once the specified condition is met:

DEFAULT : Terminate immediately the rolloutLength is reached.

END_TURN : Terminate only at the end of our turn (‘our’ refers to the root player).

START_TURN : Terminate only at the start of our turn (before we have taken any actions).

END_ROUND : Terminate only at the end of a full game round.

maxTreeDepth

The maximum depth to grow the tree. If an MCTS iteration reaches this depth, then it will not expand any nodes beyond this level, and will move to the rollout phase.

information

This has three possible values:

  1. Open_Loop A perfect information game is assumed, i.e. that the copy of the game state received by the agent is the actual state and is known by all players. In this case the set of actions taken from the root defines the current state from the perspective of MCTS, but allows for the underlying state to be different due, for example, to a stochastic environment or actions of other players that are not tracked in the tree. On every iteration of MCTS the forward model is applied to the game state to advance it from the root through the tree.

  2. Closed_Loop As Open_Loop, but further assumes that the environment is deterministic, and that the same sequence of actions will always lead to exactly the same underlying state (as might be true in Chess, Go, or TicTacToe). Hence the forward model is not applied as we descend the tree. When a node is expanded, the state stored on its parent node is copied, and the forward model applied once to generate the new state for the expanded node, which is then stored. This approach significantly reduces the number of forward model calls made compared to Open Loop - but will generally perform less well despite more iterations if the game is stochastic.

  3. Information_Set A simple form of Information Set MCTS is used. At the start of each iteration the root state is ‘redeterminised’ before the tree phase starts. Redeterminisation shuffles all the unknown information in the state to be a valid full state - or, if you prefer, it samples a full state that is in the current Information Set (i.e. consistent with everything that we can see). Otherwise this is as for Open_Loop, with the root through the tree defined by the sequence of actions taken. This is implemented using the AbstractGameState.copy(playerId) method, which should shuffle any decks we can’t see. See the reference below for more details.

    Whitehouse, Daniel, Edward J. Powley, and Peter I. Cowling. 2011. ‘Determinization and Information Set Monte Carlo Tree Search for the Card Game Dou Di Zhu’. In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), 87–94. Seoul, Korea (South): IEEE. link.

selectionPolicy

This determines how the final action is selected after MCTS is complete.

ROBUST will pick the action from the root node that has been visited most often.

SIMPLE will pick the action from the root node that has the highest expected score (ignoring any exploration term).

TREE will pick the action from the root node using the same policy as selection during search, but without exploration. This is designed for EXP3 and Regret Matching treePolicy settings.

treePolicy

This determines how MCTS makes decisions as it descends the tree. There are five supported options:

  1. UCB The default Upper Confidence Bound for Trees algorithm.:

$\qquad V(a)=\bar{X} + K\sqrt{\frac{\log(N)}{n_a}}$

Where K is the exploration constant, N is the total number of iterations through the node, and na is the number of times action a has been picked from this node. X(a) is the mean value of taking the action a from the node.

  1. UCB_Tuned This modifies the UCB formula to consider the observed variance of the rewards from rollouts. If this variance is low, then exploration will be reduced compared to UCB. The second reference below found this to be the best from several tested.

    $\qquad V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}\min\left(\frac{1}{4} + \sigma^2\sqrt{\frac{2\log(N)}{n_a}}\right)}$

    Auer, Peter, Cesa-Bianchi, Nicolo and Fischer, Paul, 2002. ‘Finite-time analysis of the multiarmed bandit problem’. Machine Learning 47 (2-3): 235-256. link

    Perick, Pierre, David L. St-Pierre, Francis Maes and Damian Ernst, 2012. ‘Comparison of different selection strategies in Monte-Carlo Tree Search for the game of Tron’. 2012 IEEE Conference on Computational Intelligence and Games (CIG), pp 242-249

  2. AlphaGo Uses a slight tweak on the exploration part of the UCB formula as used by AlphaZero.

    $\qquad V(a)=\bar{X}_a + K\sqrt{\frac{N}{1 + n_a}}$

  3. EXP3 Uses the EXP3 bandit algorithm. This constructs a SoftMax distribution over all possible actions using:

    $\qquad V(a)=\frac{e^{-\bar{X}_a}}{Z}$

Where Z is the required normalisation constant.

  1. RegretMatching Uses Regret Matching as the bandit algorithm. For details on this (plus the EXP3 algorithm) see:

    Lisy, Viliam. 2014. ‘Alternative Selection Functions for Information Set Monte Carlo Tree Search’. Acta Polytechnica 54 (5): 333–340. link

opponentTreePolicy

This has five possible values:

  1. Paranoid This assumes that all other players are acting to minimise our score (so with two players this is Minimax). At opponent nodes in the tree decisions are made using the negative game score.

  2. SelfOnly This ignores other players completely, and only constructs the MCTS tree using our actions. Hence on each node transition there are implied, but unrecorded, opponent actions taken using the specified opponent model (which by default is a random policy). This has the advantage of exploring much deeper forward in the game for a given budget, especially if the branching factor of opponent actions is high. The downside comes in not being able to account for opponent actions that could interfere with the detailed forward plan being constructed.

  3. MaxN This maintains a vector of rewards/node values during back-propagation so that each node records the value of a decision to each player independently. Hence a decision at a node is made just using the value to the deciding player. This is a much less conservative, and potentially more realistic, approach than Paranoid. (Especially in non-zero sum games, or with 3+ players).

  4. MultiTree This constructs a ‘SelfOnly’ tree for each player to improve opponent modelling during MCTS search. Instead of using an explicit opponent model to take actions for the other players, these make decisions using their own tree. Each iteration will traverse all trees in parallel, and add one node to each. This retains the benefit of deeper search with an improved implicit opponent model compared to ‘SelfOnly’, but is cannot learn any action conditionalities.

  5. MultiTreeParanoid This is as ‘MultiTree’, except that the Paranoid assumption is used, with all other players modelled as trying to minimise our score.

  6. OMA ‘Opponent Move Asbtraction’. This constructs a full tree as for MaxN, but also maintains statistics at each node as if SelfOnly were being used (i.e. it abstracts out the opponent moves). In the selection phase a policy is used that wieghts the two sets of statistics. A further parameter of MCTSParameters.omaVisits controls the effective number of visits to use for the Self-Only-like statistics.

  7. OMA_All As for OMA, but this also applies the abstraction to the other players in the tree. OMA only uses the abstraction for the root player’s moves.

For detail on the MultiTree algorithms, see

Goodman, James, Diego Perez-Liebana, and Simon Lucas. “MultiTree MCTS in Tabletop Games.” 2022 IEEE Conference on Games (CoG). IEEE, 2022. link

For Opponent Move Abstraction, see

Baier, Hendrik, and Michael Kaisers. “Guiding multiplayer mcts by focusing on yourself.” 2020 IEEE Conference on Games (CoG). IEEE, 2020. link

exploreEpsilon

This is only used for EXP3 and RegretMatching Tree Policies. It is used to define an ε-greedy exploration policy. I.e. with probability of exploreEpsilon a random decision will be taken at each node instead of using the EXP3 or Regret Matching formulae.

rolloutType

The rollout policy to be used by the root player (for opponent policies use opponentModelType). This is defined with different values of rolloutType, which can then be parameterised further:

  1. RANDOM No further parameterisation possible. This uses the default random rollout policy.

  2. CLASS This requires a class that supports the AbstractPlayer interface, and which will be used as the rollout policy. This class is then specified by the rolloutClass parameter. This must be the fully specified class name. This can either use a no-argument constructor, or a constructor with a single String argument. In the latter case the argument is separated by a ‘|’ from the class name, for a full format of rolloutClass=package.className|constructorArg.

  3. PARAMS This adds more flexibility than CLASS, and allows much more detailed specification of the rollout policy. A further parameter of rolloutPolicyParams must be specified as a full JSON-fragment defining the policy (which must be a class that implements both AbstractPlayer and ITunableParameters. An example makes this clearest:

         "rolloutType" : "PARAMS",
         "rolloutPolicyParams" : {
         	"class" : "players.simple.OSLAHeuristic",
         	"heuristic" : {
         		"class" : "games.loveletter.LoveLetterHeuristic",
         		"COUNTESS_PLAY_THRESHOLD" : 1.0,
         		"FACTOR_CARDS" : 0.5,
         		"FACTOR_AFFECTION" : 1.0
         	},
         },
    
  4. MAST This will use Move Average Sampling to define the rollout policy. This maintains an average of the reward obtained for every action played on this and previous iterations of MCTS. This can then be used to define a softmax policy using these action value estimates. For details of the MAST parameters see later on this page.

opponentModelType

This has exactly the same parameterisation options as for rolloutType. This is used when opponent decision is needed in rollout, or if the Opponent Tree Policy is ‘SelfOnly’.

expansionPolicy

This defines the order in which available actions are expanded during the MCTS Expansion phase. Possible values are:

  1. RANDOM Expand in a random order. This will actually be overridden if an Advantage Function is defined. This uses the advantageFunction parameter, which has the same format as the rolloutClass parameter described under rolloutType above.

  2. MAST This will use the MAST values of actions to determine the order of expansion. See MAST section below for details.

heuristic

This is used to define the heuristic function that should evaluate the leaf state evaluated at the end of each iteration (if this is not a game end situation), and that is back-propagated up the tree. This needs to be a JSON-fragment that defines a class that implements IStateHeuristic and optionally ITunableParameters (if further parameters need to be specified). Two examples are shown below:

```json
    	"heuristic" : {
    		"class" : "games.loveletter.LoveLetterHeuristic",
    		"COUNTESS_PLAY_THRESHOLD" : 1.0,
    		"FACTOR_CARDS" : 0.5,
    		"FACTOR_AFFECTION" : 1.0
    	},
        
        "heuristic" : {
            "class" : "players.heuristics.ScoreHeuristic"
        },
        
```

The default heuristic used is the game-specific getHeuristicScore() method.

MAST

The Move Average Sampling Technique (MAST) was introduced by:

Bjornsson, Yngvi and Hilmar Finnsson, 2009. ‘CadiaPlayer: A Simulation-Based General Game Player’. IEEE Transactions on Computational Intelligence and AI in Games 1 (1): 4-15.

On each MCTS iteration this tracks all the actions taken, and gives them equal credit for the final result of that iteration. This maintains for each action the average result obtained across all iterations that used that action. This can then be used as a game-independent heuristic to guide opponent models, rollout or expansion policies.

There are three MAST parameters that can be specified:

  1. MAST. Values of Rollout, Tree or Both are valid. Defaults to Rollout. This defines which actions are used to maintain the MAST-estimate for an action; just those used in the Rollout phase, Tree phase, or a combination of both.
  2. MASTGamma. Defaults to 0.5. This is the weight used for the MAST data after the previous MCTS iteration; i.e. the effective number of visits to each action is decayed by this factor. Setting this to 0.0 will only use data from the current iteration in MAST.
  3. MASTBoltzmann. Defaults to 0.1. This is the temperature value to be used in any softmax policy over actions defined by the MAST values (as rollout policy or opponent model).

progressiveWidening

Progressive Widening defines which actions are available at a node, and requires an ordering over actions from ‘best’ to ‘worst’. If we have N visits to the node so far, then the number of actions, A, that will be considered in the Expansion/Tree phases is given by:

$\qquad A=\lfloor{W}N^{B}\rfloor$

The two constants, W, and B, are specified by progressiveWideningConstant and progressiveWideningExponent respectively. These both default to 0.0, with no progressive widening. The original algorithm (reference below) also has a kinit parameter for the minimum for A. This is always 0 here; so that in practice W defines the minimum for A, and means we do not ‘unprune’.

As well as these parameters, then either:

  • An advantageFunction must be defined that is the name of a Class that implements ToDoubleBiFunction(AbstractAction, AbstractGameState). A String constructor argument can optionally be provided as with CLASS in rolloutType.
  • Or expansionPolicy must be set to MAST.

Chaslot, Guillaume, Mark H. M. Winands, H. Jaap van den Herik, Jos Uiterwijk and Bruno Bouzy, 2008. Progressive strategies for Monte-Carlo Tree Search. New Mathematics and Natural Computation 4 (03) : 343-357.

progressiveBias

Progressive Bias adds an additional term to the selection equation that uses an external heuristic function (introduced in the same reference as Progressive Widening above).

$\qquad V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}} + \frac{h(s,a)}{1 + n_a}$

The implementation in TAG requires a heuristic function (specified as for Progressive Widening), but deviates from this canonical form.

A parameter biasVisits specifies how much weight to apply to the the heuristic estimate of the action. The larger this is, the longer the effect of the heuristic will last. When the action has been taken a number of times equal to biasVisits, then the weighting between the heuristic and the empirically observed results will be 50:50.

$\qquad V(a)=(1-\beta)\bar{X}_a + \beta{h(s,a)} + K\sqrt{\frac{\log(N)}{n_a}}$

$\qquad \beta=\sqrt{\frac{\text{biasVisits}}{\text{biasVisits} + 3n_a}}$

Computational Budget

MCTS (and RMHC) support three types of computational budget, with the following parameters:

budgetType

  • BUDGET_TIME means we have N milliseconds of elapsed wall time to make each decision.
  • BUDGET_FM_CALLS means we have N forward model calls to make each decision.
  • BUDGET_ITERATIONS means we have N iterations of MCTS to make each decision.

The former two are the most common for comparison of different algorithms. For cross-game comparison it may be better to use a forward model budget as different games can have widely varying forward model efficiencies.

budget

Sets the N used in budgetType above.

breakMS

When using BUDGET_TIME, this will break off processing this number of milliseconds before the timer actually elapses. This is to cater for games that Disqualify an entrant if they take longer than the budgeted time to make a decision. If this is not implemented, then just set this to zero.