Home
Site Map
Home

TOH Neural Network

Situated Agents

Information Retrieval

Expert System

AI Game Players

Analysis of Computer Backgammon

and of A Trading Game

Rich Warren


1. Introduction
One of computer sciences founding fathers, John Von Neumann, first published a book on game theory in 1944. Since then, many scientists have endeavored to understand the complexities and dynamics of a number of games. This understanding is often so comprehensive that computer models can be constructed to learn how to play the game at a human expert level. Backgammon is one example of a computer model capable of human expert play. Backgammon presents a particularly interesting complex model to study because it forces the player (human and computer alike) to predominately rely on judgment as opposed to brute force methods. This paper will provide a review of several approaches taken to model an expert level Backgammon player. An open question is put forth that concerns the nature of the environment required for learning a Backgammon evaluation function (and indeed for machine learning in general). The closing paragraphs begin to analyze a much less complex and dynamic system than Backgammon. An outline is presented that attempts to argue that while this environment is less complex, it retains enough of the environment to suitably allow for a study into the minimal environmental complexity needed to support co-evolutionary learning.

2. Computer Backgammon
Backgammon is a two player board game with an ancestral heritage that traces back to Ancient Sumeria around 3000BC [1]. The game became increasingly popular in the United States around the time of the first World War. Backgammon's popularity surged in the 1970's and by the end of the decade, the first computer player won an official match against a World Champion. As we will see, today's best computer Backgammon player only plays at the Master-level. Until the computer can consistently defend a World Championship, the problem is said to be unsolved. The remainder of this paper will discuss the computationally interesting properties and strategies of Backgammon. A review of several of the classic and modern approaches taken to solve Backgammon will also follow. These include work on Knowledge Based, Supervised Learning, Reinforcement Learning, Hill Climbing and Monte-Carlo implementations.

It is assumed that the reader has some familiarity with the Backgammon board, the checkers, and the rules that define legal play. This familiarity should also include knowledge of the objectives and strategies needed to win in Backgammon. A number of URLs exist which provide introductory and comprehensive reviews of Backgammon basics [2][3]. Of the on-line Backgammon servers that exist, the two most interesting with respect to this paper are HC-Gammon [11] and FIBS [12]. HC-Gammon is discussed in detail later in this paper while TD-Gammon, also discussed later, is sometimes featured on FIBS.

2.1 Computationally Interesting Backgammon Properties
Backgammon is two player turn based board game. Because it is not possible for both players to win, Backgammon is classified as a zero-sum game. In general terms, Backgammon provides a rich artificial intelligence and machine learning research domain wherein both the rules and the criteria for success are well defined [4, 5, 6, 8]. Although the ultimate success criteria is in earning a win, a number of other performance metrics are available. How often the computer strategically hits an opponent and strategically blockades or races an opponent are examples of complex expert strategies which yield interesting assessment data. Bearoff and Doubling Cube analysis can also provide interesting performance measurements. Aside from these analytical qualities of Backgammon, it is also fairly straight forward to implement a computer model of the board, the checkers, dice, rules and stop-game criteria. It is equally as straight forward to implement a legal move preprocessor that is given the current position and produces a set of legal moves from that position. Each of the approaches discussed here rely on such a preprocessor for legal move sets. Perhaps the more interesting properties that make Backgammon a challenging computer science problem, however, are the large search space, stochastic element, predominate judgmental requirement, and complexity of expert level play.

Recall that game play follows a roll of the dice. Because there are 21 possible unique rolls and an estimated 20 possible legal moves per each roll, a total of around 400 states can be reached from each starting position [4, 5]. This is an exceptionally large branching factor when compared to a factor of about 10 in checkers and 35 in chess. It has also been estimated that there are 1020 possible board positions [5] of which a player will evaluate approximately 50-60 per game [7].

Backgammon's stochastic element, introduced by the dice, has the cumulative effect across time of forcing the player to evaluate a much larger part of the games search space [7]. Recall that there are approximately 1020 possible board positions. Forcing a player to consider somewhat random regions of space has the effect, over time, of presenting a larger number of these 1020 positions for evaluation. The stochastic element additionally leads to instability in outcome prediction. A single dice roll or improbable sequence of rolls can reverse the expected endgame [9]. This property once led BKG, a weak intermediate computer player, to beat a world champion human expert by the score of 7-1. BKG's author notes that "the program did get the better of the dice rolls [4]."

As indicated earlier, Backgammon's search space branching factor is exceptionally large. When it is considered that each move is executed within approximately 10 seconds or less, even the worlds top ranked experts must rely more on positional judgment than on brute force search [10]. It is estimated that human experts only have the cognitive capacity to search at most 2-ply. Each of the approaches discussed in this paper take advantage of this property and are highly motivated to reduce the search space and develop evaluation functions that, in general, operate on a single ply to judge the overall worth of a given set of legal moves produced by a move generator. Note that the latest publicly available version of TD-Gammon (v.2.1) is an exception as it searches up to 2-ply [8]. TD-Gammon 3.0 is complete (but not yet released) and features a 3-ply lookahead [15].

Backgammon also has a number of general properties that when combined with the more complex properties (i.e., stochastic element, large search space, etc.) force the evolution of expert playing strategies. That the rules guarantee one player or the other will win means that Backgammon has a clear and well defined termination point and thus will never end in a draw. Chess is an one example of a popular research game that can end in a draw. Backgammon is a forward moving game. That is, by the rules of fair play, the checkers may only be moved in a forward manor and not allowed back to an earlier position as they are in chess. Each of the computer players discussed in this paper demonstrate an ability to appropriately exercise fundamental strategies for game play within the complex environment defined by the above mentioned properties. A few of these strategies include: hitting, blockading and safe play. Much is also known about doubling cube and bear off strategies. While BKG, one of the programs discussed below, integrated expert knowledge of doubling with the evaluation function itself, Neurogammon dedicated a neural network to the learn the strategy and TD-Gammon exploited an external algorithm developed exclusively for making doubling cube decisions [4, 6, 10]. Each of the approaches discussed in this paper also demonstrate use of appropriate bear-off strategy. While BKG relied on a database of 54,000 entries to look-up the best move, the remaining approaches learned the strategy through self-play. Note that CD-ROM databases for over 150,000,000 bearoff equities are commercially available [13].

2.2 Classic and Modern Approaches Taken to Solve Backgammon
A review of several of the classic and modern approaches taken to solve Backgammon follows. These include work on Knowledge Based, Supervised Learning, Reinforcement Learning, Hill Climbing, and Monte-Carlo implementations.

2.2.1 BKG
BKG is the earliest computer Backgammon game to reach a significant level of play. The program was developed in 1979 by Hans Berliner to address the problem of positional judgment in large search spaces. The general approach taken by Berliner was to generate and evaluate all legal moves from the current position and produce an output best move. BKG was a heavily knowledge engineered system that used domain expertise to derive a number of evaluation factors. Some of these factors include the safety of the pieces, the extent to which the pieces are blockaded, and the degree to which either side is ahead in the race [4]. Notice that these features are quantifiable and may be thought of in terms of probability of being hit, mobility, and overall fitness [5].

BKG Evaluation Functions
Berliner developed three successive evaluation functions. The first of these functions, the polynomial was limited in its game playing ability while the second function, based on categorization of moves, made serious tactical blunders. Berliner's third evaluation function, the non-linear function, reached a level of play equal to that of a human intermediate player.

Polynomial in BKG
Berliner's earliest evaluation function was a linear polynomial in which each term represented a feature and its coefficient represented a weighting of the relative importance of that feature. This evaluation function, however, lacks expressiveness as information specific to a given feature is lost by averaging its value with the values of the neighboring features. Lack of expressiveness represents a problem when, for example, a move is determined to be poor by the evaluation function but judged to be a fairly good move by an expert who may recognize the importance of a particularly strong single feature.

Categorization of Moves in BKG
A second and improved approach was to characterize classes of Backgammon moves that were each evaluated using different functions. Partitioning the search space in such a manor, however, was not without its drawbacks. Berliner discovered that the delta in relative values given to neighbors was too great. One would expect a small relative difference in the strength of two fairly similar moves. What Berliner classified a rough boundaries, could lead to serious tactical mistakes. When BKG evaluates, for example, a move that leads to a running game and evaluates a move that leads to a blocking game, errors in comparing the two were often made.

Smoothness, Nonlinearity and Application Coefficients (SNAC) in BKG
Rather than fragment the search space, Berliner warped the space [4] [17]. This had the effect of assigning a higher weight to certain features in one part of space and smoothly lessening the weight of those features further in space. Because the weights are incrementally changed as space is transitioned, a good deal of context is preserved. Recall that the poor expressiveness of the Polynomial function was a particular problem.

Berliner claims that his SNAC function is aware of 30 Backgammon strategies and that the program keeps track of the importance of each strategy for a given position [4]. The points out that at any time, one strategy might be given the highest priority while others can be completely ignored. This has a desired effect of gradually shifting strategies when appropriate rather than making an immediate change in strategy. Recall that a large difference in the strength of two equally valid moves led Berliner to abandon the use of Categories of Moves.

BKG Discussion
BKG reached intermediate level of game play. Because it is a heavily knowledge engineered system (as opposed to a learning system), the best its creator could possibly hope for is a level of play on par with that of the human expert that provided the knowledge. Such systems are also at a deficit because it is not improbable that expert strategies will continue to evolve and improve. As this happens, BKG's feature sets will have to be updated to reflect the change in expert strategies.

As BKG is the first computer backgammon player, Berliner demonstrated an early ability to knowledge engineer a doubling cube feature set. Berliner recognized that the decision to double or not is decided upon by examining the move that would immediately follow the double. Recall that there are approximately 400 moves from a given position. Berliner placed each of resulting 400 positions into one of 3 categories: too strong, not good enough, and just right. Berliner's doubling cube evaluation function is then based on which of the categories contained the majority of possible moves.

2.2.2 Neurogammon
Neurogammon is a classic computer Backgammon player because it is the first example of a neural network approach to the game. The program is existence proof that standard three layer feed forward neural network technology can, with supervised learning, generalize on expert knowledge and perform well on tasks which require good judgment. The program has the added distinction of being the first learning program to ever win a Backgammon tournament and reach intermediate level of play. At the time of Neurogammon's development, the authors believed that supervised learning provided greater quantity and quality of information (made available by the teacher) [5]. It is then interesting to note that Neurogammon's successor, TD-Gammon uses reinforcement learning and has reached Master's level of play.

Evaluation Function in Neurogammon
Neurogammon's evaluation function is based on a neural network topology consisting of a single network dedicated to the doubling cube and six networks to make the move decision [6]. Which of the networks that make the decision depends on the phase of the game. While the doubling network is trained on a set of 3000 examples, the move generators are trained on a set of 32000 positions collected from 400 games. The training set is created by placing a heavier weight on the move that the expert would make and a lesser weight on the remaining legal moves. In other words, of the set of possible moves (400) exactly one move would have a higher weight. Tesauro and Sejnowski drew their knowledge from a number of sources including published material on the analysis of Backgammon moves, games with the systems self and with games played against Tesauro who himself is an expert player.

The input representation consisted of the current state of the board, the set of legal moves (generated by a pre-processor function) and their respective score as assigned by a human expert. The majority of the 3200 moves were not evaluated by the expert so were assigned an initial score of zero. Of those moves that were evaluated, most lead to engagement with the opposing force, one-thousand were running moves, some were bear off, and often one move per board position was evaluated as poor. To enhance the program's performance under certain classes of moves, a feature set of about 150 board configurations was developed. A typical Neurogammon mistake is made when, for example, a move leading to a holding or blockading strategy is opted for instead of the correct one that results in hitting the challengers blot. Some of the features that used to quantify the relative strength of two opposing strategies include the strength of blockade and the probability of being hit.

Neurogammon Discussion
While the most frequent mistakes made by Neurogammon were in bearing off, many were also made with blots and with escaping a blockade. At least some of these mistakes were obvious "blunders." The authors believe that these mistakes can be corrected, to an extent, through additional training with hand crafted examples. More importantly, however, the authors feel that the lack of any notion of timing or point coverage is fatal and will in any case prohibit an expert level of play. In the final analysis, Neurogammon won 35-40% of the games it played against human experts and 60% of the games against a bench mark computer Backgammon player. To its credit, however, Neurogammon did seem to generalize on several strategies which include running, hitting, and maintaining a prime formation (though not always when appropriate).

2.2.3 TD-Gammon
Gerry Tesauro's TD-Gammon is the first computer Backgammon game to learn the rules and strategies for good play without human intervention. Because BKG is knowledge intensive and Neurogammon is a supervised learning neural net, both require human expert knowledge. Furthermore, one can expect that the strategies human experts employ in backgammon will undergo change as cumulative knowledge of the games dynamics increases [8]. This change in the weightings experts assign to particular strategies would necessarily have to be retrofitted in the knowledge or expert systems developed prior to TD-Gammon. Using the time differential or TD-lambda reinforcement learning algorithm originally developed in the 1970's, Tesauro demonstrates that it is possible for a computer to learn an intermediate evaluation function without human intervention [7]. Tesauro also demonstrates that TD-Gammon is able to learn a nearly optimal evaluation function through a combination of reinforcement learning and knowledge of the expert feature sets developed for Neurogammon.

Learning the Evaluation Function Without Knowledge in TD-Gammon
TD-Gammon's state is initialized with a random strategy, the position of the checkers, and their color. The program is never shown human knowledge of the game's rules or strategies. At each time step the network generates a move based on the weighting of the expected (sometimes referred to as predicted outcome or equity estimate). After several hundred to several thousand timesteps the actual outcome is known and the network weights are adjusted prior to the next game. Notice that because the game is played to its end before the weights are adjusted, a time difference equal to the number of turns expires. What has become known as the temporal credit assignment problem, arises in assessing the incremental contribution (good or bad) each of the inputs and outputs made to the final solution [8]. The TD-lambda learning rule, based on the difference between successive predictions, specifically addresses the assignment problem [14] .

TD-Gammon's network training typically takes on the order of hundreds or thousands of games in order for its performance to reach a level equivalent to that of a strong intermediate player or to its predecessor Neurogammon. Although this version was not given knowledge of the existence of a Doubling Cube, the network has learned many elementary strategies such as hitting and building points. The authors also show that as the network topology is expanded and the number of training epochs increased, improved performance (i.e., automatic feature discovery) is achieved. This data suggests the architecture is scaleable.

Learning the Evaluation With Knowledge in TD-Gammon
As in the previous learning paradigm, TD-Gammon's state is initialized with a random strategy, the position of the checkers, and their color. Unlike with the previous learning method however, the current evaluation function is given expert knowledge of the hand crafted feature sets that were created for Neurogammon. The expert knowledge has a large impact on driving TD-Gammon's performance to Master-level or World-class play. Doubling cube functionality is added to the system which enables competition against human players. The cube's function is based on the expected outcome of the game and a set of formulas developed a decade earlier. The most recent public version of TD-Gammon (Version 2.1) is further modified to look ahead 2ply [8][10]. It has been reported that TD-Gammon 3.0 is complete (but not yet released) and features a 3-ply lookahead that demonstrates improved performance over version 2.1 [15].

TD-Gammon Discussion
The majority of technical errors made by TD-Gammon are found in its endgame play and are classified by Tesauro as practically inconsequential. As the functionality of the doubling cube is under the control of a mathematical formula it would be interesting to evaluate game improvement when the endgame is under the control of a similar look-up function. The author believes that continued training (of possibly larger nets) and additional human expert knowledge may lead to improved gameplay. While Tesauro believes that a finer grained input representation may also improve gameplay, he is uncertain about the principals of TD-Gammon's success. One contributing factor may prove to be the stochastic element introduced by the dice. Recall that with every roll there are approximately 400 positions to consider. This state space property establishes a lower bound on the width of the search space and guarantees sufficient space to allow the network to generalize. It may prove, however, that the stochastic element combined with reinforcement learning, forces the network to search space that other approaches may ignore. The reinforcement method will cause the network to more deeply introspect and try new strategies.

Note that cognitive science disagrees with the principals of reinforcement learning as it is a well understood cognitive paradigm that has long been discredited [9]. Additionally, since the reinforcement signal is considerably less detailed than the signals backpropigated through a net under supervised learning, a greater number of training epochs are required. Finally, the underlying principals that explain how TD-Gammon plays at the Master-level are poorly understood. The author's suggests that contributions to better understand the science behind TD-Gammon are necessary. As will be discussed below, hillclimbing learning approach is motivated by Tesauro's suggestion.

2.2.4 HC-Gammon
The Hill-Climbing or Coevolved solution to Backgammon is a recent approach developed within the last couple of years. Authors Pollack et. al. describe a neural network system that exploits hill-climbing rather than generalized delta, reinforcement or temporal difference alternatives [9]. HC-Gammon does, however, adjust the network weights when and iff the game ends in a win for the challenger. Recall that Tasauro's approach is also based on the outcome of self-play [8]. While the HC-Gammon fitness function won fewer than half of the games that it played against a public domain linear evaluation function [16], released by Tesauro, the approach has made the authors' point and raised questions about the contributing factors that lead TD-Gammon to success.

Evaluation Function in HC-Gammon
The neural network topology consists of 197 input units that represent the number of checkers on each point and their color, the number of checkers on the bar and weather or not the game is in the endgame phase. The input layer is fed to a total of twenty hidden units which in turn feed a single output unit that represents the evaluation of the current position. This topology is based on Tesauro's TD-Gammon network.

The network was trained against an evaluation function that mutated from the current function. If the mutated function won in excess of half the games then it became the new evaluation function to replace the less competent one. The problem with this method, however, is that a sufficient number of games to allow the system to more thoroughly test a given function is not computationally feasible. Evaluation functions that had good potential were, as a result, replaced by other potentially lucky functions.

To correct for this, the authors programmed the system to play pairs of games. The current evaluation function would generate the first games opening move while the mutated function opened the second game. Additionally, the authors modified the replacement algorithm such that the current evaluation function was not swapped with the challenger but rather fine adjustments were made to move it along in the direction of the mutated function. These adjustments were made if the mutant evaluation function won 3 out of 4 games. After 20,000 generations of mutated evaluation functions, his method of hill-climbing led to a system capable of beating the benchmark move generator close to 35 percent of the time and then began to drift towards poorer performance. The authors conjectured that the drift was in response to the increase in the frequency of adjustment as the two evaluation functions became nearer to each other. Recall that this version adjusted the current function after each game as opposed to the simple replacement algorithm of earlier.

In attempt to solve the drift problem, the authors introduced an annealing schedule and an increased number of generations. Rather than require a game to win 3 out of 4 games beyond 10,000 generations, they must win 5 out of 6 and 7 out of 8 beyond 70,000 generations. This solution had the desired effect of eliminating the drift explained earlier. In addition the overall performance is reported to have slightly increased to a level where HC-Gammon wins roughly 40 percent of the matches against Tesauro public evaluation function. It is also interesting to note that in endgame analysis, Hill-Climbing backgammon came to within a couple of rolls, on average, to match PUBEVAL's bearoff performance.

Hill-Climbing Discussion
Recall that Tesauro conjectured that the dynamics of reinforcement learning combined with the stochastic nature of Backgammon may account for the systems excellent performance. Consider for the remainder of this discussion that TD-Gammon, without access to Neurogammon's feature set, played equally as well as Neurogammon. This means that the version of TD-Gammon won 35-40 percent of the games played against human experts and 60 percent of the games played against Gammon Tool. While TD-Gammon is initialized with a random set of weights, HC-Gammon is initialized with zero value weights. Both implementations, are fed only raw state data and both learn evaluation functions in a stochastic environment. Most importantly, however, both systems learn the rules and strategies for good play from playing games against itself. This introduces the idea that the very evaluation function, in the case of both TD-Gammon and HC-Gammon, is itself dynamic and coevolving. That is, the evolving evaluation function learns from a teacher that is itself undergoing evolution.

HC-Gammon reached a level of play where it won 40 percent of the games played against Tesauro's PUBEVAL evaluation function. The only substantial difference between TD-Gammon and HC-Gammon is found in the learning paradigm. That both systems reach levels of performance somewhat close to one another in part answers Tesauro's question concerning the contributing factors to the success of TD-Gammon. Tesauro wrote that it is perhaps the reinforcement learning method coupled with the stochastic nature of Backgammon that leads TD's success. HC-Gammon shows, however, that hillclimbing in a relative fitness environment combined with the stochastic nature of Backgammon can lead to somewhat similar success. The argument is made that the stochastic element and the dynamics of the evaluation function are more important factors in the success of learning an evaluation function. One HC-Gammon review notes that the hillclimbing algorithm is much like a genetic algorithm using weighted recombination on a population size of two [18].

2.2.5 Monte-Carlo Approach
The Monte-Carlo approach is Tesauro and Galperin's continuing research project aimed at exploiting parallel computation to address the large number of training epochs required in reinforcement learning. Although the project is in its infancy, the authors have published promising preliminary results and expect that ultimately, this approach will outperform the knowledge based 3-ply lookahead TD-Gammon version [10].

Monte-Carlo Evaluation Function
The idea behind the Monte-Carlo evaluation function is to start with a baseline function and create a new function that is guaranteed to improve the worst case expected outcome of the baseline function. The authors construct 3 two layered feed forward baseline neural network linear functions for comparative analysis. Input to all three baseline networks is the raw state data discussed throughout this paper. The first of the two baselines produced a level of play comparable to a beginning intermediate player while the second of the two was intentionally crippled to simulate a novice or weak beginner level evaluation function. The third baseline network was given access to hand crafted feature sets (probably derived from Neurogammon) and thus played somewhat stronger than the first network.

Monte-Carlo players were derived from each of the three baseline networks. The algorithm is highly parallel in nature thus many Monte-Carlo simulations per each benchmark can execute in tandem. After five-thousand trials, the Monte-Carlo players were matched in play against a baseline configuration of TD-Gammon. The TD-Gammon baseline was fed only raw state data, did not have access to the expert feature sets and was restricted to searching only a single ply. Analysis of the game data collected show that the first two Monte-Carlo players learned to achieve a performance level about equal to the baseline TD-Gammon player. The third or knowledge privileged Monte-Carlo player outperformed the TD-Gammon player and is estimated to play at a level comparable to the strong expert level achieved by TD-Gammon enabled with 2-ply search and knowledge. None of the players (NN Baseline, Monte-Carlo or TD-Gammon) had knowledge of the Doubling Cube so related performance measurements were not collected.

Monte-Carlo Discussion
While the initial results of this Monte-Carlo approach show promise, the authors speculate that if their current implementation were to lead to a consistent World Class Champion (i.e., "solve" Backgammon), then on the order of 200 seconds per move would be required. This is far in excess of the 10 or so seconds seen in tournament play. The authors claim to have made some progress, however, in optimization of the Monte-Carlo algorithm for run-time performance. One such optimization, statistical pruning, ensures that when one option is clearly dominate over another, the weaker of the two options is pruned and thus ignored throughout the remainder of the simulation. A second patch to increase run-time performance is to place an upper bound on the length of play. Whereas previously the Monte-Carlo data was collected over an entire game, this new approach limits the game length. Typically, this upperbound is in the neighborhood of 10 percent of a full length trial.

2.3 Computer Backgammon Conclusions
A number of approaches taken to solve computer Backgammon have been discussed. The first, BKG, is clearly a knowledge engineering approach while the second, Neurogammon is a supervised learning expert system. While Neurogammon does in fact generalize on the training set presented to it and thus can be said to learn, it does nonetheless suffer from the same fundamental problem that plagues BKG. Namely, the problem is one of knowledge encapsulation. That is, it remains a difficult problem to abstract features and their dynamics from a human expert and represent them in such a way that a computer can use them to equal or surpass the human expert's capacity for judgment [8].

Perhaps the two more interesting approaches discussed are TD-Gammon and HC-Gammon. Recall that versions of TD-Gammon prior to 1.0 did not use hand crafted expert knowledge and reached a level of play approximately equal to the intermediate play that Neurogammon demonstrates. Also recall that HC-Gammon reached a level of play where it won 35-40% of its matches against a popular public domain intermediate level evaluation function. The interesting point is that both game machines were initialized with a set of random or zero weights, were shown no expert knowledge, and yet were able to learn, through self teaching, a respectable evaluation function for positional judgment. Because the approaches taken by both authors are technologically disparate, questions that concern the underpinnings for learning in judgmental tasks are posed.

Tesauro conjectures that a stochastic element is critical to TD learning and that it is this element that empowers his reinforcement learning algorithm to successfully acquire the judgmental capacity needed to play Backgammon. HC-Gammon authors Pollack et. al., however, argue that the structure of the learning task combined with the dynamics of an environment wherein relative fitness is quantifiable were the critical elements to TD-Gammon's success. To test their theory, Pollack et. al. replaced the TD algorithm with simple hillclimbing and demonstrated a level of play close enough to that of the baseline TD to call into question the importance of reinforcement learning in judgmental tasks such as those required to play Backgammon.

Before we make a large scale investment in the simple but promising co-evolutionary learning with hillclimbing approach to complex judgmental tasks, however, a better understanding of the properties of dynamic relative fitness environments is needed. It is after all these properties that the authors Pollack et. al. claim allow for co-evolutionary learning to proceed [9]. As the game of Backgammon represents a highly complex system, however, it may prove difficult to accurately understand, articulate and evaluate all of the many dynamics at play. This authors question then is: "What minimal complexity or minimal set of dynamics is necessary and sufficient to allow co-evolutionary learning with hillclimbing to proceed?" In the section that follows, a proposal is made to cast a predominately judgmental betting game in terms that might allow exploration for answers to this central question. The Stock Options betting game offers a somewhat less complex and dynamic environment than Backgammon yet retains a subset of the elements that Pollack et. al. argue are necessary for co-evolutionary learning.

3.0 Stock Options Betting Game
Although time and space requirements do not allow for a comprehensive and in depth discussion of the complexities and dynamics of options trading, an attempt is made to sketch out several initial and as yet unsupported thoughts (disclaimer).

3.1 Analysis
Analysis of the Stock Options Betting (SOB) game is motivated by a need to better understand the nature of the requirement for complexity and dynamics in a co-evolutionary hillclimbing learning environment. The co-evolutionary approach met with success in a Backgammon environment, but the authors, Pollack et. al., state that they are uncertain about which of the many properties of that environment contributed to the success of their project. Figure 1 outlines many of these properties (discussed in depth above) for review. If it turns out that SOB retains the most significant of these properties but is sufficiently less complex and less dynamic than Backgammon, then it should make a suitable environment in which to explore the minimal complexity/minimal dynamics necessary for co-evolutionary hillclimbing learning.

1. two player game
2. Zero-Sum game
3. stochastic element from dice
4. game can not end in a draw
5. well defined set of game play rules
6. well defined success criteria
7. several complex strategies
8. self play learning
9. random initial strategy
10. raw state data for input
11. position features
12. extremely large search space
13. predominate judgmental task

Figure 1. Computer Backgammon Properties

SOB is a two player game wherein one player sells an option and the other purchases it. Unlike with Backgammon however, SOB is not necessarily a zero-sum game. Both players can win if, for example, one player sells a $70 option for $5 and the stock reaches $80 by its expiration date. In this case the seller earned the $5 for selling the option and the buyer earned $10 for the right to buy at 70 and sell at $80.

Perhaps the most significant of the properties listed in the figure is that of a predominate judgmental task. The goal of a Backgammon evaluation function is to judge approximately 400 board positions and to output the one with the highest score. An evaluation for SOB would only have to evaluate a few dozen positions, representing the different options available for a given month, and output the best several options. A reduction from 400 to about 24 positions in the search space represents a significant smoothing in the complexity of the system.

Recall from earlier discussion that a number of Back Gammon features must be learned to in order to evaluate a position for the most appropriate strategy. Furthermore, it was shown that the relative strength of these features nonlinearly combine. While a SOB evaluation function is also based on a set on nonlinear features, they are fewer in number and thus create less overall complexity. This is especially true when viewed in the absence of a regular stochastic element.

The stochastic element in backgammon is introduced by the dice and effects the regions of search space considered. Recall also that a particularly lucky dice roll can have the effect of reversing the current expected outcome of each player. SOB has a stochastic like element that effects the expected outcome but this element is introduced to the system at irregular intervals. Experts believe, for example, that the recent Hong Kong Stock Market crash also caused the US market to crash. When Iraq sunk an oil freighter in the straight of Hamule and effectively cut oil supply to the US, the stock market was significantly impacted. These are both examples of a pseudo-random natural Phenomena that exert a degree of randomness on the market.

3.2 Conclusion
Recall once again that Pollack et. al. assert that the success of their hillclimbing system is due in part to the co-evolution of training and in part to the dynamics of their environment. Near removal of the stochastic element and radical reduction in the size of the search space combined with significantly less complex playing strategy would necessarily bring the dynamics nearer to a minimum. While holding the co-evolution in training constant and flattening or minimizing the environmental dynamics, one can begin to more closely identify those properties that allow for learning to occur. The above proposed system, SOB, may present a suitable candidate environment in which to study minimum requirements for evolutionary learning.


References
[1] Masters, J., A History Of Traditional Games: Backgammon, HTML Page
[2] Turner, S., The WWW Backgammon Page, HTML Page
[3] Kieth, T., Backgammon Galore, HTML Page
[4] Berliner, H., 1980. Computer Backgammon. Sci. Am. 243, 64-72
[5] Tesauro, G., and Sejnowski, T.J. 1989. A parallel network that learns to play backgammon. Artificial Intelligence 39, 357-390.
[6] Tesauro, G., 1989. Neurogammon wins Computer Olympiad. Neural Computation 1, 321-323.
[7] Tesauro, G., 1994. TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play. IBM Thomas J. Watson Research Center, HTML Page
[8] Tesauro, G., 1995. Temporal Difference Learning and TD-Gammon. Communications of the ACM. 38, 58-68. HTML Page
[9] Pollack, J., Blair, A., and Land, M., 1996. Coevolution of a Backgammon Player. Proceedings of the Fith Artificial Life Conference. May, Nara, Japan. HTML Page
[10] Galperin, G., and Tesauro, G., Parallel Monte-Carlo Simulation of Nirual Network Controllers. HTML Page
[11] Pollack, J., Blair, A., and Land, M., HC-Gammon, HTML Page
[12] Schnieder, A., First Internet Backgammon Server. HTML Page
[13] Sconyer, H., Bearoff Equities and Backgame Probabilities. Gammon Press. v1 HTML Page
[14] Tepandi, J., Temporal difference learning and TD-Gammon, ACM Review., Communications of the ACM. HTML Page
[15] Scott, J., Gerald Tesauro's TD-Gammon. HTML Page
[16] Tesauro, J., Public Evaluation Function (PUBEVAL) Source Code, HTML Page
[17] Berliner, H., Principal Research Scientist. HTML Page
[18] Scott, J., The Hillclimbing HC-Gammon. HTML Page


(c) Richard G. Warren - all rights reserved