MCTS Parameters
TAG currently supports the following variants/parameterisations of MCTS within MCTSParams
(as of March 2022).
Given the options, it is recommended that an MCTS agent be defined by a JSON file, in which values for each of these can be specified. For more details and examples of specifying an MCTS agent in JSON, see this documentation.
 UCB exploration constant, K
 The maximum length of each rollout rolloutLength
 When to terminate each rollout rolloutTermiantion
 The maximum depth to grow the tree maxTreeDepth
 Information Set MCTS, Open or Closed Loop variants, information
 The final action selection policy, selectionPolicy
 The tree policy to use, treeType
 The policy that other players should use in the tree opponentTreePolicy
 The exploration constant to use for EXP3 and RegretMatching tree policies, exploreEpsilon
 The rollout policy to use, rolloutType
 Any explicit opponent model to use, opponentModel
 The expansion policy to use, expansionPolicy
 The heuristic to use when valuing a leaf state, heuristic
 Move Average Sampling (MAST), MAST
 Progressive Widening, progressiveWidening
 Progressive Bias, progressiveBias
K
The weighting to give the the exploration term in the UCB equation. This should be scaled to the game score  if the result of a game is 1 for a win, and 0 or 1 for a draw then a value of about 1.0 or sqrt(2) is the standard choice.
This scaling to the score will take place automatically if the normaliseRewards
parameter is true
(which is the default). In this case the largest and smallest final scores seen on all rollouts is tracked during each MCTS iteration, and used to scale all future rewards to the range [0, 1].
rolloutLength
The maximum number of actions to take during the rollout phase of MCTS, i.e. once we leave the tree.
rolloutTermination
This has four possible values.
If rolloutLength
is greater than zero, then the rollout will run for this number of actions, and then terminate once the specified condition is met:
DEFAULT
: Terminate immediately the rolloutLength
is reached.
END_TURN
: Terminate only at the end of our turn (‘our’ refers to the root player).
START_TURN
: Terminate only at the start of our turn (before we have taken any actions).
END_ROUND
: Terminate only at the end of a full game round.
maxTreeDepth
The maximum depth to grow the tree. If an MCTS iteration reaches this depth, then it will not expand any nodes beyond this level, and will move to the rollout phase.
information
This has three possible values:

Open_Loop A perfect information game is assumed, i.e. that the copy of the game state received by the agent is the actual state and is known by all players. In this case the set of actions taken from the root defines the current state from the perspective of MCTS, but allows for the underlying state to be different due, for example, to a stochastic environment or actions of other players that are not tracked in the tree. On every iteration of MCTS the forward model is applied to the game state to advance it from the root through the tree.

Closed_Loop As Open_Loop, but further assumes that the environment is deterministic, and that the same sequence of actions will always lead to exactly the same underlying state (as might be true in Chess, Go, or TicTacToe). Hence the forward model is not applied as we descend the tree. When a node is expanded, the state stored on its parent node is copied, and the forward model applied once to generate the new state for the expanded node, which is then stored. This approach significantly reduces the number of forward model calls made compared to Open Loop  but will generally perform less well despite more iterations if the game is stochastic.

Information_Set A simple form of Information Set MCTS is used. At the start of each iteration the root state is ‘redeterminised’ before the tree phase starts. Redeterminisation shuffles all the unknown information in the state to be a valid full state  or, if you prefer, it samples a full state that is in the current Information Set (i.e. consistent with everything that we can see). Otherwise this is as for Open_Loop, with the root through the tree defined by the sequence of actions taken. This is implemented using the
AbstractGameState.copy(playerId)
method, which should shuffle any decks we can’t see. See the reference below for more details.Whitehouse, Daniel, Edward J. Powley, and Peter I. Cowling. 2011. ‘Determinization and Information Set Monte Carlo Tree Search for the Card Game Dou Di Zhu’. In 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), 87–94. Seoul, Korea (South): IEEE. link.
selectionPolicy
This determines how the final action is selected after MCTS is complete.
ROBUST
will pick the action from the root node that has been visited most often.
SIMPLE
will pick the action from the root node that has the highest expected score (ignoring any exploration term).
TREE
will pick the action from the root node using the same policy as selection during search, but without exploration. This is designed for EXP3 and Regret Matching treePolicy
settings.
treePolicy
This determines how MCTS makes decisions as it descends the tree. There are five supported options:
 UCB The default Upper Confidence Bound for Trees algorithm.:
$\qquad V(a)=\bar{X} + K\sqrt{\frac{\log(N)}{n_a}}$
Where K is the exploration constant, N is the total number of iterations through the node, and n_{a} is the number of times action a has been picked from this node. X(a) is the mean value of taking the action a from the node.

UCB_Tuned This modifies the UCB formula to consider the observed variance of the rewards from rollouts. If this variance is low, then exploration will be reduced compared to UCB. The second reference below found this to be the best from several tested.
$\qquad V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}\min\left(\frac{1}{4} + \sigma^2\sqrt{\frac{2\log(N)}{n_a}}\right)}$
Auer, Peter, CesaBianchi, Nicolo and Fischer, Paul, 2002. ‘Finitetime analysis of the multiarmed bandit problem’. Machine Learning 47 (23): 235256. link
Perick, Pierre, David L. StPierre, Francis Maes and Damian Ernst, 2012. ‘Comparison of different selection strategies in MonteCarlo Tree Search for the game of Tron’. 2012 IEEE Conference on Computational Intelligence and Games (CIG), pp 242249

AlphaGo Uses a slight tweak on the exploration part of the UCB formula as used by AlphaZero.
$\qquad V(a)=\bar{X}_a + K\sqrt{\frac{N}{1 + n_a}}$

EXP3 Uses the EXP3 bandit algorithm. This constructs a SoftMax distribution over all possible actions using:
$\qquad V(a)=\frac{e^{\bar{X}_a}}{Z}$
Where Z is the required normalisation constant.

RegretMatching Uses Regret Matching as the bandit algorithm. For details on this (plus the EXP3 algorithm) see:
Lisy, Viliam. 2014. ‘Alternative Selection Functions for Information Set Monte Carlo Tree Search’. Acta Polytechnica 54 (5): 333–340. link
opponentTreePolicy
This has five possible values:

Paranoid This assumes that all other players are acting to minimise our score (so with two players this is Minimax). At opponent nodes in the tree decisions are made using the negative game score.

SelfOnly This ignores other players completely, and only constructs the MCTS tree using our actions. Hence on each node transition there are implied, but unrecorded, opponent actions taken using the specified opponent model (which by default is a random policy). This has the advantage of exploring much deeper forward in the game for a given budget, especially if the branching factor of opponent actions is high. The downside comes in not being able to account for opponent actions that could interfere with the detailed forward plan being constructed.

MaxN This maintains a vector of rewards/node values during backpropagation so that each node records the value of a decision to each player independently. Hence a decision at a node is made just using the value to the deciding player. This is a much less conservative, and potentially more realistic, approach than Paranoid. (Especially in nonzero sum games, or with 3+ players).

MultiTree This constructs a ‘SelfOnly’ tree for each player to improve opponent modelling during MCTS search. Instead of using an explicit opponent model to take actions for the other players, these make decisions using their own tree. Each iteration will traverse all trees in parallel, and add one node to each. This retains the benefit of deeper search with an improved implicit opponent model compared to ‘SelfOnly’, but is cannot learn any action conditionalities.

MultiTreeParanoid This is as ‘MultiTree’, except that the Paranoid assumption is used, with all other players modelled as trying to minimise our score.

OMA ‘Opponent Move Asbtraction’. This constructs a full tree as for MaxN, but also maintains statistics at each node as if SelfOnly were being used (i.e. it abstracts out the opponent moves). In the selection phase a policy is used that wieghts the two sets of statistics. A further parameter of
MCTSParameters.omaVisits
controls the effective number of visits to use for the SelfOnlylike statistics. 
OMA_All As for OMA, but this also applies the abstraction to the other players in the tree. OMA only uses the abstraction for the root player’s moves.
For detail on the MultiTree algorithms, see
Goodman, James, Diego PerezLiebana, and Simon Lucas. “MultiTree MCTS in Tabletop Games.” 2022 IEEE Conference on Games (CoG). IEEE, 2022. link
For Opponent Move Abstraction, see
Baier, Hendrik, and Michael Kaisers. “Guiding multiplayer mcts by focusing on yourself.” 2020 IEEE Conference on Games (CoG). IEEE, 2020. link
exploreEpsilon
This is only used for EXP3 and RegretMatching Tree Policies. It is used to define an εgreedy exploration policy. I.e. with probability of exploreEpsilon a random decision will be taken at each node instead of using the EXP3 or Regret Matching formulae.
rolloutType
The rollout policy to be used by the root player (for opponent policies use opponentModelType). This is defined with different values of rolloutType
, which can then be parameterised further:

RANDOM No further parameterisation possible. This uses the default random rollout policy.

CLASS This requires a class that supports the
AbstractPlayer
interface, and which will be used as the rollout policy. This class is then specified by therolloutClass
parameter. This must be the fully specified class name. This can either use a noargument constructor, or a constructor with a singleString
argument. In the latter case the argument is separated by a ‘’ from the class name, for a full format ofrolloutClass=package.classNameconstructorArg
. 
PARAMS This adds more flexibility than CLASS, and allows much more detailed specification of the rollout policy. A further parameter of
rolloutPolicyParams
must be specified as a full JSONfragment defining the policy (which must be a class that implements bothAbstractPlayer
andITunableParameters
. An example makes this clearest:"rolloutType" : "PARAMS", "rolloutPolicyParams" : { "class" : "players.simple.OSLAHeuristic", "heuristic" : { "class" : "games.loveletter.LoveLetterHeuristic", "COUNTESS_PLAY_THRESHOLD" : 1.0, "FACTOR_CARDS" : 0.5, "FACTOR_AFFECTION" : 1.0 }, },

MAST This will use Move Average Sampling to define the rollout policy. This maintains an average of the reward obtained for every action played on this and previous iterations of MCTS. This can then be used to define a softmax policy using these action value estimates. For details of the MAST parameters see later on this page.
opponentModelType
This has exactly the same parameterisation options as for rolloutType. This is used when opponent decision is needed in rollout, or if the Opponent Tree Policy is ‘SelfOnly’.
expansionPolicy
This defines the order in which available actions are expanded during the MCTS Expansion phase. Possible values are:

RANDOM Expand in a random order. This will actually be overridden if an Advantage Function is defined. This uses the
advantageFunction
parameter, which has the same format as therolloutClass
parameter described under rolloutType above. 
MAST This will use the MAST values of actions to determine the order of expansion. See MAST section below for details.
heuristic
This is used to define the heuristic function that should evaluate the leaf state evaluated at the end of each iteration (if this is not a game end situation), and that is backpropagated up the tree. This needs to be a JSONfragment that defines a class that implements IStateHeuristic
and optionally ITunableParameters
(if further parameters need to be specified). Two examples are shown below:
```json
"heuristic" : {
"class" : "games.loveletter.LoveLetterHeuristic",
"COUNTESS_PLAY_THRESHOLD" : 1.0,
"FACTOR_CARDS" : 0.5,
"FACTOR_AFFECTION" : 1.0
},
"heuristic" : {
"class" : "players.heuristics.ScoreHeuristic"
},
```
The default heuristic used is the gamespecific getHeuristicScore()
method.
MAST
The Move Average Sampling Technique (MAST) was introduced by:
Bjornsson, Yngvi and Hilmar Finnsson, 2009. ‘CadiaPlayer: A SimulationBased General Game Player’. IEEE Transactions on Computational Intelligence and AI in Games 1 (1): 415.
On each MCTS iteration this tracks all the actions taken, and gives them equal credit for the final result of that iteration. This maintains for each action the average result obtained across all iterations that used that action. This can then be used as a gameindependent heuristic to guide opponent models, rollout or expansion policies.
There are three MAST parameters that can be specified:
 MAST. Values of Rollout, Tree or Both are valid. Defaults to Rollout. This defines which actions are used to maintain the MASTestimate for an action; just those used in the Rollout phase, Tree phase, or a combination of both.
 MASTGamma. Defaults to 0.5. This is the weight used for the MAST data after the previous MCTS iteration; i.e. the effective number of visits to each action is decayed by this factor. Setting this to 0.0 will only use data from the current iteration in MAST.
 MASTBoltzmann. Defaults to 0.1. This is the temperature value to be used in any softmax policy over actions defined by the MAST values (as rollout policy or opponent model).
progressiveWidening
Progressive Widening defines which actions are available at a node, and requires an ordering over actions from ‘best’ to ‘worst’. If we have N visits to the node so far, then the number of actions, A, that will be considered in the Expansion/Tree phases is given by:
$\qquad A=\lfloor{W}N^{B}\rfloor$
The two constants, W, and B, are specified by progressiveWideningConstant
and progressiveWideningExponent
respectively. These both default to 0.0, with no progressive widening. The original algorithm (reference below) also has a kinit parameter for the minimum for A. This is always 0 here; so that in practice W defines the minimum for A, and means we do not ‘unprune’.
As well as these parameters, then either:
 An
advantageFunction
must be defined that is the name of a Class that implementsToDoubleBiFunction(AbstractAction, AbstractGameState)
. A String constructor argument can optionally be provided as with CLASS in rolloutType.  Or expansionPolicy must be set to MAST.
Chaslot, Guillaume, Mark H. M. Winands, H. Jaap van den Herik, Jos Uiterwijk and Bruno Bouzy, 2008. Progressive strategies for MonteCarlo Tree Search. New Mathematics and Natural Computation 4 (03) : 343357.
progressiveBias
Progressive Bias adds an additional term to the selection equation that uses an external heuristic function (introduced in the same reference as Progressive Widening above).
$\qquad V(a)=\bar{X}_a + K\sqrt{\frac{\log(N)}{n_a}} + \frac{h(s,a)}{1 + n_a}$
The implementation in TAG requires a heuristic function (specified as for Progressive Widening), but deviates from this canonical form.
A parameter biasVisits
specifies how much weight to apply to the the heuristic estimate of the action. The larger this is, the longer the effect of the heuristic will last. When the action has been taken a number of times equal to biasVisits
, then the weighting between the heuristic and the empirically observed results will be 50:50.
$\qquad V(a)=(1\beta)\bar{X}_a + \beta{h(s,a)} + K\sqrt{\frac{\log(N)}{n_a}}$
$\qquad \beta=\sqrt{\frac{\text{biasVisits}}{\text{biasVisits} + 3n_a}}$
Computational Budget
MCTS (and RMHC) support three types of computational budget, with the following parameters:
budgetType
BUDGET_TIME
means we have N milliseconds of elapsed wall time to make each decision.BUDGET_FM_CALLS
means we have N forward model calls to make each decision.BUDGET_ITERATIONS
means we have N iterations of MCTS to make each decision.
The former two are the most common for comparison of different algorithms. For crossgame comparison it may be better to use a forward model budget as different games can have widely varying forward model efficiencies.
budget
Sets the N used in budgetType above.
breakMS
When using BUDGET_TIME
, this will break off processing this number of milliseconds before the timer actually elapses. This is to cater for games that Disqualify an entrant if they take longer than the budgeted time to make a decision. If this is not implemented, then just set this to zero.