Adaptive AI Engine for RTS Games

Discussing the theory and practice

Learning – Artificial Intelligence for Games – A Summary

Posted by merothehero on July 28, 2009


A summary of selected topics in the chapter

From the Book: Artificial Intelligence for Games – 2006

Chapter 7

Author: Ian Millington

  1. I.            Introduction

Online or Offline

  • Majority of Learning is done offline (either between levels of the game or at the development studio)
  • Online Learning is while the game is playing.

Intra-Behavior Learning Algorithm

  • The simplest kind of learning that changes a small part of the character’s behavior
  • Good Example: learning how to evade, learning to find cover points in a room
  • Bad Example: if a character is trying to jump a wall it won’t tell him to use the stairs instead

Inter-Behavior Learning Algorithm

  • The character learns “new” things (not in a specific scope)
  • Example: A Character that learns to tie a rope across the street to stop an escaping motorbike
  • Unfortunately this kind of AI is almost pure fantasy.
  • It can be implemented by choosing to between a range of different behaviors, but still each distinct behavior will require implementation by the developer
  • It’s not economically good to learn everything, basic movement systems for example is easier and faster to be implemented directly.
  • Developers are experimenting on replacing the ordinary decision making systems (FSM, FUSM, Scripting, Decision Trees, and Markov Systems) with Learning Techniques


  • The biggest problem with learning is reproducibility and quality control.
  • Learning the Wrong thing is a big problem that industrial companies found
  • The more flexible your learning is the less control you have on your game-play
  • Normal solution to this is to limit the kind of things that can be learned in a game

Over-Learning (Over-fitting)

  • When the AI learns only from specific situations or experiences, but we normally want the AI to be able to generalize the learning.
  • Neural Networks for example can over-fit during learning if they are wrongly parameterized or if the network is too large for the learning task at hand. (From my Readings over-fitting happens because of local maximum points)

The Balance of Effort

  • Learning is done to decrease the implementation work ,because you don’t need to handle every event or trick happening to the character, instead it will learn to handle it itself.
  • Unfortunately, Learning often increases the implementation work, because they require a lot of work such as: presenting data in the correct way, making sure the results are valid and testing the AI to avoid it learning the wrong things.
  1. II.            Parameter Modification


  • Not considered a “learning” technique.
  • The simplest Learning technique is to calculate the value of one or more parameters such as: cost functions for path finding and possibilities in decision making.
  • Most commonly learning parameters like these is done offline, but can usually be controlled when performed online.
  • For each value of the parameter there is some energy value (or fitness value) which represents how good the value of the parameter is for the game.
  • Fitness landscape is used when we’re looking for maximum, while Energy Landscape is used when we’re looking for minimum. So they are just different terminologies.

Getting the score out of the parameter values

  • If the value can be generated from some function or formula, if it’s a simple mathematical formula then we can differentiate it so the best values can be found explicitly. In this case there’s no need for parameter optimization.
  • In most cases, this formula doesn’t exist, the only way to find out the value is to try the game and see how it performs and produce a fitness or energy score for the parameter.
  • We should have a mechanism to determine the value for a set of a parameters quickly such as: allowing the game to run at high speed without rendering the screen or use a set of heuristics that generate a value based on some assessment criteria without ever running the game, if there is no way to determine the value for the parameters other than running the game with the player, then all the techniques in the chapter are unlikely to be practical.

Hill Climbing

  • An Algorithm that tries to work out in what direction to change the parameter in order to improve its score.
  • It often fails to get the optimal solution if the score of the parameter values is random.
  • Its Performance is O(n) where n is the number of parameters while it’s O(1) in memory.

Extensions for Hill Climbing

  • There are Several extensions to support hill climbing with more random score:
    • Momentum: if the search is consistently improving in one direction, then it should continue in that direction for a little while, even if it seems that things aren’t improving anymore. It takes much longer to reach the best parameter value, but it doesn’t get stuck easily in a local maximum point on its way to the peak.
  • Adaptive Resolution: when a parameter is long away from the best value, taking small steps in learning is slow but if the steps are large, the best value could never be reached. This problem is solved by Adaptive Resolution, where long jumps are made earlier and small jumps later on, and as long as the score is improving the steps are bigger, when its stops improving the steps become smaller.
  • Multiple Trials: Hill Climbing is very dependent on the initial guess, so trying the algorithm from a different initial starting location will get the optimum score after some trials, However this Extension is not suitable for online learning because the player doesn’t expect that the AI suddenly gets worse (because it starts the hill climbing again with an initial value which can be a low value)
  • In Most Problems, we can’t even recognize the best solution when we find it, for example in an RTS game the best use of resources into construction and research, if we ran 200 trials we can’t guarantee that the best score had been scored in these trials, there’s no formula that tells us that this is the best solution, so we must just choose the most satisfying solution even if it might not be the best result.


  • Similar idea to Hill Climbing, and will be read later due to time constraints.
  1. Action Prediction

N-Gram Predictors

  • N-Grams are used in various statistical techniques and are not limited to prediction , they have applications in analysis of human languages
  • A 3-Gram –for example- will keep track of the number of times each sequence of 3 choices is seen. For prediction, the first 2 choices form the window and the probability is calculated by looking at the proportion of times each window is taken for the third choice.
  • Example :In This string  “LRRLRLLLRRLRLRR” The 3-Gram Probabilities :
Window/3rd choice ..R ..L
LL ½ ½
LR 3/5 2/5
RL ¾ ¼
RR 0/2 2/2
  • Increasing the window size increases the efficiency of the algorithm but decreases the prediction.
  • As the Sequence of Actions gets more random, the window size needs to be reduced.
  • In Predictions with more than 2 choices, the minimum window size will need to be increased a little, also bigger window size get much poorer prediction results.
  • There are mathematic al models that are used to tell you how well an N-gram predictor will predict a sequence and they are used to tune the optimal window size.
  • A Hash Table could be used for better memory performance especially for large number of choices.

Hierarchical N-Grams

  • Several N-Gram algorithms work in parallel, each with increasingly large window sizes, A Hierarchical 3-Gram will have regular 1-Gram, 2-Gram and 3-Gram algorithms working on the same data.
  • A sequence of “LRR” is passed to hierarchical 3-Gram, gets registered in the 3-Gram, the “RR” portion gets registered in the 2-gram, and “R” gets registered in the 1-Gram.
  • When a prediction is requested, the algorithm first looks up the window actions in the 3-Gram, if there have been sufficient examples of the window then it uses the 3-Gram to generate its prediction, otherwise it looks at the 2-Gram, likewise, the prediction may come from 1-Gram. If none of the N-Grams have sufficient Examples, then the algorithm returns no prediction or random prediction.
  • There is no single correct threshold value for the number of entries required for confidence. To some extent it needs to be found by trial and error. In Online learning, it’s common for the AI to make decisions based on very sketchy information, so the confidence threshold  is small (3 or 4)
  • The algorithm is O(n) in memory and O(n) in time where n is the highest number N-Gram used.
  • The most widespread application of N-Gram is in combat games.
  1. Decision Learning

Weak Vs Strong Supervision

  • In order to improve performance, we need to provide feedback to the learning algorithm. This feedback is called “supervision,”
  • Strong supervision takes the form of a set of correct answers; these correct answers are often provided by a human player. The developer may play the game for a while and have the AI watch. The AI keeps track of the sets of observations and the decisions that the human player makes. It can then learn to act in the same way.
  • Weak supervision doesn’t require a set of correct answers. Instead, some feedback is given as to how good its action choices are.
  • Strong supervision is easier to implement and get right, but it is less flexible: it requires somebody to teach the algorithm what is right and wrong. Weak supervision can learn right and wrong for itself, but is much more difficult to get right.
  • Each of the remaining learning algorithms in this chapter works with this kind of model. It has access to observations, and it returns a single action to take next. It is supervised either weakly or strongly.
  1. V.            Decision Tree Learning

ID3 Algorithm

  • Algorithm for generating Decision Trees from a set of rules
  • Example: The Input could be the following :

Healthy – In Cover – With Ammo: Attack

Hurt – In Cover – With Ammo: Attack

Healthy – In Cover – Empty: Defend

Hurt – In Cover – Empty : Defend

Hurt – Exposed – With Ammo: Defend

  • The Output will be the following Decision Tree:

  • The Details of the ID Algorithm will be discussed Later
  1. Reinforcement Learning


  • Reinforcement learning is the name given to a range of techniques for learning based on experience.
  • In its most general form a reinforcement learning algorithm has three components: an exploration strategy for trying out different actions in the game, a reinforcement function that gives feedback on how good each action is, and a learning rule that links the two together. Each element has several different implementations and optimizations, depending on the application.
  • One reinforcement learning technique is Q-learning which is a good start, easy to implement and can be tuned without a deep understanding of its theoretical properties.

Q-Learning – Introduction

  • Q-learning treats the game world as a state machine. At any point in time, the algorithm is in some state. The state should encode all the relevant details about the character’s environment and internal data.
  • In a game the states are made up of many factors: position, proximity of the enemy, health level, and so on. Q-learning doesn’t need to understand the components of a state. As far as the algorithm is concerned they can just be an integer value: the state number.
  • Q-learning is known as a model-free algorithm because it doesn’t try to build a model of how the world works. It simply treats everything as states. Algorithms that are not model-free try to reconstruct what is happening in the game from the states that it visits. Model-free algorithms, such as Q-learning, tend to be significantly easier to implement.
  • These four elements—the start state, the action taken, the reinforcement value, and the resulting state—are called the experience tuple, often written as (s, a, r, s).

How It Works

  • The experience tuple is split into two sections. The first two elements (the state and action) are used to look up a Q-value in the store. The second two elements (the reinforcement value and the new state) are used to update the Q-value based on how good the action was and how good it will be in the next state.
  • The update is handled by the Q-learning rule:

Q(s, a) = (1α) Q(s, a) +α(r +γ max (Q (s, a))),

Where α is the learning rate, and γ is the discount rate. Both are parameters of the algorithm.

  • The first component Q(s, a) is simply the current Q-value for the state and action.
  • The r value is the new reinforcement from the experience tuple
  • γ max(Q(s_, a_)), looks at the new state from the experience tuple. It looks at all possible actions that could be taken from that state and chooses the highest corresponding Q-value. This helps bring the success (i.e., the Q-value) of a later action back to earlier actions: if the next state is a good one, then this state should share some of its glory.
  • The discount parameter controls how much the Q-value of the current state and action depends on the Q-value of the state it leads to. A very high discount will be a large attraction to good states, and a very low discount will only give value to states that are near to success. Discount rates should be in the range [0, 1]. A value greater than 1 can lead to ever-growing Q-values, and the learning algorithm will never converge on the best solution.
  • So, in summary, the Q-value is a blend between its current value and a new value, which combines the reinforcement for the action and the quality of the state the action led to.

Exploration Strategy

  • It’s the policy for selecting which actions to take in any given state.
  • The basic Q-learning exploration strategy is partially random. Most of the time, the algorithm will select the action with the highest Q-value from the current state. The remainder, the algorithm will select a random action. The degree of randomness can be controlled by a parameter.

Convergence and Ending

  • If the problem always stays the same, and rewards are consistent (which they often aren’t if they rely on random events in the game), then the Q-values will eventually converge. Further running of the learning algorithm will not change any of the Q-values. At this point the algorithm has learned the problem completely. For very small toy problems this is achievable in a few thousand iterations, but in real problems it can take a vast number of iterations. In a practical application of Q-learning, there won’t be nearly enough time to reach convergence, so the Q-values will be used before they have settled down. It is common to begin acting under the influence of the learned values before learning is complete.


  • O(i) in time , where i is the number of iterations of learning
  • O(as) in memory where a is the number of actions, and s is the number of states per action, in case of using arrays
  • Using a Hash Table will decrease the order of the memory. However it will increase execution time

Tuning the Parameters

  • Will be discussed later

Choosing Rewards

  • Will be discussed later

Temporal-Difference learning

  • Will be discussed later

Applications and Limits

  • Reinforcement learning is most suitable for offline learning. It works well for problems with lots of different interacting components, such as optimizing the behavior of a group of characters or finding sequences of order-dependent actions. Its main strength is its ability to seamlessly handle uncertainty. This allows us to simplify the states exposed to it; we don’t have to tell the algorithm everything.
  • It is not suitable for problems where there is an easy way to see how close a solution is (we can use some kind of planning here), where there are too many states, or where the strategies that are successful change over time (i.e., it requires a good degree of stability to work).
  • It’s strong in board games and turn-based strategy games
  • It’s possible to use a neural network to act as a storing medium for Q-Values, the author is not aware of any developers who have used this combination successfully.
  1. Artificial Neural Networks


  • Developers who have experimented with neural networks for large-scale behavior control have been left in no doubt of the approaches weaknesses. The combined hype and disappointment has clouded the issue. AI-savvy hobbyists can’t understand why the industry isn’t using them more widely, and developers often see them as being useless and a dead end.
  • It is a huge subject, full of different kinds of network and learning algorithms specialized for very small sets of task. Very little neural network theory is applicable to games, however.
  • The Focus will be on only one type of neural networks, which is the multi-layer perceptron with one particular learning rule, which is the back-propagation algorithm. This type of neural network is the most common one that’s why the author chose it.


  • Will be discussed later


  • Will be discussed later


  • Will be discussed later


  • O (nw) in memory, where n is the number of perceptrons, and w is the number of inputs per perceptron.
  • In time, the performance is also O (nw).

Other Approaches: Radial Basis Function

  • Will be discussed later

Other Approaches: Weakly Supervised Learning

  • Will be discussed later

Other Approaches: Hebbian Learning

  • Will be discussed later

2 Responses to “Learning – Artificial Intelligence for Games – A Summary”

  1. ManS said

    tamam ya 2bo el zeek 😀

  2. abdelrahmanogail said

    Tamam eh ya5oya 😀

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: