Reinforcement Learning - 
Tic Tac Toe Game


About Reinforcement Learning


Reinforcement learning is the process for a program to interact within an environment and learn by receiving numerical rewards or punishment while trying to obtain a goal. In this case, the goal is to win when dealing with games. The environment is “where” and “how” a program, that is using reinforcement learning, interacts against its opponents in the game. 
             The reward/punishment is a scalar value r, where r є Â.  The reward received at time t is represented as rt,. The time is a way of representing a state tree, where each action that the program makes will move the program to the next level of time. Usually the reward is given at time t + 1 instead of time t is because the reward is received after the agent has selected an action and arrives at the next state, st+1. One basic rule about the reward system is that it should somehow represent the goal of the game and not how you, the programmer, might want the program to achieve winning. This will be explained shortly, but first a little more information on states and actions. 

A state is represented by s, where s є S, and S is the set of all possible states. The program, when playing the game, is in a state and given a choice of options that are actions represented as, a, where a є A, and A is the set of all possible actions. Depending on the program’s decisions and actions, it will make a transition from state st to state st+1. The diagram below shows an example of a possible state-action tree.   An agent is what decides, in the program, the paths to take when given a state and/or actions to maximize the rewards it could receive. The maximization of rewards is defined as the expected return, Rt = rt+1 + rt+2 + …+ rT where T is the last time step. The agent has no control over the environment and the rewards it receives. The agent may have knowledge of everything in the environment or it can be limited. As in the case of card games, the agent should not have the ability to see what the other players are holding. The agent follows a policy that will help it find a path, giving high rewards. 

The policy, π, is the set of rules, a function, or tables that maps actions, and/or states to the probabilities of a reward. The policy is what the agent uses to find the path to a reward without looking at each possible solution in the tree from the current state. The policy function is noted as πt(s, a), where at = a, if st = s. The policy influences the agent’s decision in implementing the game. Any subtle changes in the policy can change the way the agent explores or exploits the play of the game.

Exploration is when an agent decides to explore the states and actions, not always taking the path to the best reward. Exploration allows for finding the majority of the paths if not all in which the best paths are found. A problem is that many games that are played will have to take up a large amount of memory or be played so often that it could take a large amount of time for the policy to be competitive.

While in exploitation, the agent is encouraged to take the best path to the highest reward. This will allow for fewer games that need to be played, but it is necessary to explore and find possibilities. Exploration allows the agent to visit possible states that improve its outlook and allow for better game play. A use of both exploration and exploitation is necessary, where one alone will not be the best solution in developing learning.

A combination of both or using a majority of exploitation and a little exploration is probably the best solution to finding competitive programs. These concepts are developed through a function that allows for the agent to see the predicted reward values; this function is called the value function. The agent follows a value function, which shows what the potential reward can be in the long run if the agent follows the current policy given by predicting the reward function.

In any game, where players alternate moves, there is usually a set of sequences that can be grouped. This is called, in reinforcement learning, an episode. In an episode the last state is known as the terminal state. The majority of games are best modeled using episodes: the computer might make a play prompting its opponent to make a play thus, giving one episode. These set of episodes, in games, are given the name episodic tasks. The states without the terminal state will be noted as S and the states including the terminal state will be noted as S+.

Now, let’s redefine reinforcement learning more clearly using the knowledge up to this point. Reinforcement Learning is a stochastic problem, where the value function may not represent the actual true value at first. This is due to estimation in the beginning and not every future event may be represented or known with certainty. By looking at what is expected, stochastic problems can be solved giving a solution that is optimal over a set of episodes.

-=Evaluation Feedback=-

              Previously, exploitation was discussed as an extreme approach to search the state tree since it does no exploring. The combination of the two concepts, where there is more exploitation then exploration was suggested. This notion can be used in the e-greedy approach.

An agent that follows only exploitation is called greedy. e-greedy is a near greedy approach where the range of values that an agent takes greedily may not always be the best value. This encourages some exploration while taking a strong path of exploitation. e, is a small fraction subtracted from the greediest value. This is an improved approach compared to being greedy as it allows for paths to higher average rewards.

An example of using e-greedy, would be an agent who is in a state with possible actions giving the values of 0.9, 0.8, and 0.7. Under exploitation, the agent will always select the path offering the highest value. In this case, the value is 0.9. This is an understandable action, but why not explore when given paths with slightly lowered values. In the e-greedy approach, the value of e may be 0.1. The agent will randomly choose from values that are ranging from the highest value (0.9) to the highest value minus e (0.9 – 0.1 = 0.8). The agent will, therefore, randomly choose between 0.9 and 0.8. As the value of e increases, so does the chance of exploration. This increase in value will give the agent a larger selection of states to choose from.

-Dynamic Programming-

Now skipping ahead... Dynamic Programming is one of the many techniques used in Reinforcement Learning for evaluating the value of an state. In Dynamic Programming the values of each successor state of st are backed up together by the value function; this is called a full backup. All backing up in dynamic programming is done this way. The agent first selects the best value given the policy, when this is done, the back up process begins for that state. Of all the possible state's values that could have been reached are used to update the previous seen state. The process of backing up is done after each move. 

 






-Monte Carlo-

 Monte Carlo, unlike Dynamic Programming, the values are not backed up until the end of the game. Here the reward is used to average the sample returns of the episode. This style of improvement only effects the states that were visited, not just seen as in the Dynamic Programming concept. 

-Temporal Difference-

Unlike Monte Carlo, backing up does not have to wait till the end in Temporal-Difference, because it does it's backup just like in Dynamic Programming, where backup is done at the next time step of t + 1. The difference is that Temporal Difference, like Monte Carlo, only uses the state's values of states that were visited. 

Temporal-Difference allows for experience, even if limited in episodes or steps, to converge quickly to an answer. This can be used to show the difference between Monte Carlo and Temporal Difference. If given only a very short amount of episodes and this training data is used multiple times over and over again, then the value function must converge to some answer. This type of updating is called batch updating because the process of backing up does not occur until the end of the training data. Now, if we use this process of batch updating on Monte Carlo Methods and TD(0), then we can see the difference between these two methods as they converge to an answer.  

The batch Monte Carlo method has a tendency to converge to the minimized mean-squared error according to [Sutton & Barto 98]. The batch TD(0) method maximizes the probability of predicting the returns, so if given a model the TD(0) would try to estimate the best known estimate for the value function. This process is called certainty-equivalence estimate because the estimate is known with certainty instead of approximation and this is what batch TD(0) converges to.

Below is a quick look at the differences between the concepts of reinforcement learning.

  Temporal Difference Dynamic Programming Monte Carlo Exhaustive Search
Learning Raw Model Raw None
Back Up Type Sample Full Sample Full
Back Up Depth Shallow Shallow Deep Deep



Here are a few screen shots of this program:


Download


Requirements/Recommend:

File:

Click Here to Download Version 1.2

Fixes:

  • Increased learning speed and decrease xml file size by implementing rotation, and reflection of the states.
  • Fixed one minor bug in the Monte Carlo learning in recognizing a final state
  • Fixed one minor bug in the Temporal Difference learning in updating the state values

Instructions:

Installation:

Simply unzip the file and place the folder in the desire location. Run the [TicTacToe.exe] file to run the program. 

Playing:

To start  playing a game, select File -> New -> Game and then select the type of game you want to play from Single Player vs. Computer,  Two Player Game, or Simulation to have the computer play itself. 

Choosing the Computer Opponent:

There are three different type of computer you can play against: Dynamic Programming, Monte Carlo, or Temporal Difference. The default is the Monte Carlo. To choose your opponent, go to Edit -> Opponent.

Edit the Computer's Learning Style:

You can also change how the program learns for each of the computer learning styles by going to 
Edit -> Agent -> Learning Style -> [and select among the three]

Simulation:

You can also change the amount of games the agent will play during simulation, I do warn not to make the value to large. I estimated that approximately 100 games is equivalent to 1 minute. 

Data:

Under the Data in the menu items, you can view the results of win percentages, state values, the xml file, and total number of visited states so far. 

Sources


[Andrus 04] Andrus, W., (2004). Reinforcement Learning in Games. Master’s Thesis, Presented to the Department of Computer Science at the State University of New York Institute of Technology in Utica, New York

[Bellman 62] Bellman, R., (1962). Dynamic Programming. Princeton University Press.

[Brafman & Tennenholtz 03] Brafman, R.I. and Tennenholtz, M. (2003). "Learning to Coordinate Efficiently: A Model-based Approach", Volume 19, pages 11-23. JAIR. Retrieved June 25, 2004 from http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume19/brafman03b.pdf

[Conitzer & Sandholm 04]Conitzer, V., Sandholm, T., (2004). Communication Complexity as a Lower Bound for Learning in Games. In Proceedings of the 21st International Conference on Machine Learning (ICML-04), Banff, Alberta, Canada. Retrieved June 24,2004 from http://www-2.cs.cmu.edu/~conitzer/conitzer_sandholm_comm_games.pdf

[Conitzer & Sandholm 03] Conitzer, V., Sandholm T., (2003). BL-WoLF: A Framework For Loss-Bounded Learnability In Zero-Sum Games. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 91-98, Washington, DC, USA, Retrieved June 24, 2004 from http://www-2.cs.cmu.edu/~conitzer/conitzer_sandholm_blwolf.pdf

[Meyer 98] Meyer, F., (1998). Java Black Jack and Reinforcement Learning. Logic Systems Laboratory, EPFL. Retrieved July 20, 2004, from Swiss Federal Institute of Technology-Lausanne, Switzerland. http://lslwww.epfl.ch/~aperez/BlackJack/classes/RLJavaBJ.html.

[Mitchell 97] Mitchell, T., (1997). Machine Learning. MIT Press, Singapore.

[Nilsson 98] Nilsson, N., (1998). Artificial Intelligence: a new synthesis. Morgan Kaufmann Publishers, Inc., San Francisco, California.

[Schaeffer & van den Herik 01] Schaeffer, J., and van den Herik, J., (2001). Games, computers, and artificial intelligence. Elsevier Science B. V.

[Sutton & Barto 98] Sutton, R. S., and Barto, A. G. (1998). Reinforcement learning an introduction. MIT Press, Cambridge, MA.

[Tesauro 95] Tesauro, G., (1995). Temporal Difference Learning and TD-Gammon.Communications of the ACM 38, 58-68.