COMP343 - Reinforcement Learning
Study 4 : Reinforcement Learning
Our final study is a quick and dirty introduction to reinforcement learning, specifically temporal-difference Q-learning. A learning agent has been started for you. You’ll polish it off, train it, and test it. The specifics of the algorithm will be discussed in the remaining class sessions. A set of quick-notes with references to the relevant sections of the text will be provided.
ToDo
- In the
TDQ
class inlearning.py
, implement an exploration function based policy in the_selection
method. - In the
TDQ
class inlearning.py
, implement temporal-difference updates in thelearn
method. Note that this method learns from a complete decision sequence and is, in practice, called after a game is played and given the history of that game. - In
testlearning.py
develop code to train then test yourTDQ
agent.
Finishing the TDQ
Agent
Much of the learning agent has been completed or stubbed out for you. You’re responsible for two things: implementing a learning selection policy that utilizes an exploration function and implementing a function that learns by updating the Q-function estimates using temporal difference updates. Below is a brief look at what’s already done and some reminders and pointers about what’s left for you to do.
What’s Done
Construction The TDQ
agent constructor is complete and creates an agent with Q-function table self._Q
and a (state,action) count table self._N
. Both of these are implemented as defaultdict
(aka Default dictionary) structures with a default value of 0. Keys are (state,action)
tuples. Accessing a key that has not previously been added to the dictionary will return the default value and create an entry for that key.
MDP Modelling Functions A reward function and learning rate function have been provided for you as _reward
and _alpha
, respectively. You are encouraged to start with them but welcome to try different functions if needed.
Learning and Playing Policies The TDQ
agent can function in two modes: learning mode and playing mode. In learning mode, the agent will use it’s exploration-based selection policy. In playing mode it uses a mixed-strategy based on its current estimates for the Q-function. The method gameTime
turns on playing mode while studyTime
turns on learning mode. The playing mode policy has been provided as _play
. The getMove
function is already setup to choose a policy based on the agent’s mode. You do not need to modify getMove
. You also do not need to modify _play
but are welcome to tinker with this policy if you wish. You will be implementing the learning time policy _selection
.
Study Buddy The TTTEngine
will provide a complete, move-by-move history of a game. Your agent just wants it’s decisions, i.e. the states it was presented with and the actions it took in those states. Assuming your agent is player 1, the helper method _getp1
will take a game history and extract these decisions for you. What’s more it will include with those decisions the state the followed the action. That means that _getp1
returns the (state,action,next_state)
triples for each of your agent’s decisions. This lets you focus on learning and not worry about pre-processing a game history. Because that pre-processing should proceed learning updates, a line invoking _getp1
has already been added to the stub for learn
.
To Do 1: _selection
When learning, your agent should not just play moves that it thinks are good (exploitation) but try out moves of unknown quality as well (exploration). As discussed, the exploration function rates an action in a given state such that the agent can choose the highest rated action. The rating system is such that a balance is struck between exploration and exploitation. Our notes offers two different exploration functions for you to choose from. The _selection
function should rate all the available actions for the current state and return the highest rated action according to your exploration function.
To Do 2: learn
Your agent learns by first playing a complete game using it’s exploration-function based selection policy and then using the decisions made in that game to update it’s Q-function estimates. The learn
method in TDQ
manages these updates. It’s given the history of game and uses _getp1
to extract your agent’s decision sequence. You need to update self._Q
with that sequence using the temporal-difference update method. Once again, a learning rate function has been provided for you, _alpha
. You’ll need to decide on a discount rate and update the Q-function estimates based off each decision in the sequence. Discount rates should be high (one or near one) as our problem is really about the pay out in the future (the end of the game). That being said, you’re welcome to play with different discount rates and see what happens.
Learning
Before your TDQ
agent is ready for prime-time, it should practice and learn the game. In testlearning.py
, you will setup an informal test for your agent. It goes like this:
- In learning-mode, play N games against another agent and learn from each game.
- In playing-mode, play M games against a opponent.
- If your agent has truly learned the game and if the opponent is perfect (or human), then you would expect all M games to be a draw.eng.re
- If your agent has truly learned the game and if the opponent is not-perfect (maybe also human…), then you would expect it to either win or play to draw in all M games. No losses.
- If your agent still has things to learn, then you might expect it to lose some of the M games.
There are definitely more rigorous ways to test a machine-learning based system, but we’ll save that for Applied Machine-Learning. For now, we’ll just stick to the “looks like a duck, quacks like a duck, must be a duck” method of verifying that our agent has learned.
Experimentation and Validation
Your code in testlearning.py
should be setup to run a single experiment. That means start with an un-trained TDQ
agent, train it (step 1), then test it (step 2). Here are some variables you should explore when your agent is learning:
- Vary the opponent when training. Try random and a perfect agent. The truly patient of you will play your agent while it learns and get the benefit of watching it improve (warning: be ready to play 100s or 1000s of games…)
- Vary the number of training games. Try to find out the minimum number needed for your agent to “get good”.
- Combine 1 and 2. Does your agent learn faster or better when playing random agents, perfect agents, or imperfect agents?
- Vary things like the learning rate, discount rate, exploration function, and reward function.
The goal here is to figure out the requirements and limitations of learning. The ideal is learning to play perfect tic-tac-toe using the least amount of computation resources as possible. Do some informal experimentation and see what you can learn about learning tic-tac-toe.
When checking just how well your agent plays, you should always play M>1 games. You should also check your agent’s play against an opponent other than the one it played against when learning. The easiest, but most time consuming way to do this, is for you to play against the TDQ
agent and mix up your own play. Play well, play poorly, play with a mixed-strategy, just probe the agent with your play style. The fastest way to check your agent is to use a mixed-strategy, perfect agent, play M silent games and count our draws and losses. Two computation agents will play 50+ games faster than you and your agent. Try a bit a both.
Whenever your agent plays a game via a TTTEngine
you must reset the engine before it’s next game. This is done via the TTTEngine
method resetGame
. Do not forget this step. If you do, then your agent will play one game and one game only.