Homework 4: Tabular Reinforcement Reinforcement Learning
This homework uses small state and action spaces, and uses the SARSA, or ‘on policy’ and Q-learning or ‘off policy’ learning algorithms to do table based reinforcement learning. This should help you gain a better feel for reinforcement learning itself, preparing you for applying it to problems that require function approximation. There are quite a few questions, but we hope you can complete the homework with about 3 hours of effort, once you understand the concepts in lectures and readings.
Setup Instructions
To run the code needed for this homework you must either login to the server computer, or have git, python3, matplotlib, and numpy installed on your own computer. You can download the code from github by running the following command from a terminal:
git clone https://github.com/zergylord/ReinforcementLearningHomework.git
Once you have a copy of the code, open a second terminal window and enter the ReinforcementLearningHomework directory (cd on unix system) in both windows. Because the code mirrors the underlying learning process, the ideal setup is to be able to look at, and edit, the code in one window (using a standard text editor, e.g. vim, emacs) and run the code in the other (using the command “python3 homework.py”).
Lava World
Imagine you are in a 5x5 grid where you only know the unique cell that you are currently in, and only have the ability to move one cell up, down, left, or right. As shown in the figure below, you get a (positive) reward for reaching the goal cell, and a negative outcome (-1) called death for entering the pool of lava (Both of these events are terminal; the agent goes back to the starting state). All other states are non-terminal and give 0 reward (if you try to move into a side of the space, nothing happens.
-1,terminal / -1,terminal / -1,terminalstart / -1,terminal / -1,terminal / -1,terminal / +1, terminal
We’ve provided you with python code corresponding to this world. “env.reset()” initializes the agent such that he is in the start state. “env.step(a)” takes action a, which corresponds a step in one of the cardinal directions (0=right,1=up,2=left,3=down), and returns a tuple containing the identity of the next state, the reward, and whether or not the episode is over (i.e. you are now in a terminal state). The loop that actually runs the model is very simple, as you can see if you inspect the code.
We’ve also provided you with the data structure needed to solve this problem: a simple rank 3 tensor of shape (5,5,4), which will hold the learned values of each state-action pair (i.e. the Q values). When you run the simulation, information based on these tensors will be displayed, as discussed later.
Two temporal difference learning algorithms are implemented and toggled between using the onpolicy variable: when this is set to True, the SARSA algorithm is used. When onpolicy is set to False, the Q-learning algorithm is used. Remember from class that being ‘on policy’ refers to learning about the same policy that you are executing as you explore the environment.
The control variable determines whether the agent’s policy (defined by the table of Q values) depends on its value estimates (control = True), or whether instead a completely random policy is used (control = False). When control = True the agent follows an epsilon-greedy policy: That is, it takes the action with the highest value with probability 1-epsilon. Thus epsilon specifies the probability of taking a random action in the environment when control = True (setting control = False is the same as setting epsilon = 1.0 with control=True). Default value for epsilon = .2.
The alpha variable is the learning-rate -- it determines how quickly existing knowledge in the value table is replaced by new knowledge. The gamma variable is the discount factor, which controls the extent to which future rewards are worth less than immediate rewards. Default values are alpha = .1; gamma = .9.
To summarize, learning onpolicy corresponds to learning the values of states under the expectation that the actions taken during use of the learned knowledge will be based on the policy in place during learning. With control = True the policy during learning is random with probability epsilon, and otherwise is based on the optimal action. With control = False the policy is completely random -- the same as setting epsilon to 1. Learning offpolicy corresponds to learning the values of states under the expectation that the actions taken during use of the learned knowledge will always be the action with the highest value. Actions during learning depend on the values of control and epsilon, but what is learned is based on what would be optimal, according to the Q-learning algorithm, if control were True and epsilon were 0 (no randomness at all).
Running the simulation. Run the code (python3 homework.py), noting that the default is to learn on-policy without control (onpolicy is True and control is False). The simulator will pause after taking its very first (random) step and after every 1000 steps thereafter, so you can look at the state information before pressing return to continue. On the screen you will see the number of steps run so far, then some statistics based on steps taken since the last refresh: rewards obtained and deaths suffered. Two windows displaying state information will also appear, and you will need to drag one so that it does not overlap with the other. The action-specific value window displays four 5-by-5 grids, and each grid displays the current estimated value of one of the four actions in each state. The summary window displays the parameters of the run, and presents two 5-by-5 grids: The left grid shows the max of the Q values for each state, while the other shows the number of times the agent has visited each state since the last time the display was updated. Note that there are no actions taken after reaching the reward or death states, so values of actions in those states never change. Visits to these states are not counted, so the state visits are always 0. To see the exact value of a cell, select the window by clicking in it and then position the mouse over the cell. The value will be displayed in the lower right corner of the window. You must move the mouse to update the value that is displayed.
Run the simulator and just observe the evolution of value knowledge once or twice. Note that with no control, the agent’s behavior during learning will never improve, although its value estimates will gradually change as it explores its environment. The values displayed will largely stabilize (they do continue to bounce around a little) before all of the steps that the code allows have been run. You can speed things up by hitting ‘enter’ repeatedly or holding down the enter key. To exit, close both display windows (not the command window!) or use ^C, then enter in the command window. Also note that you can edit the total number of steps that could be run (num-steps), or the number run between screen refreshes (refresh). Be sure to save the file to ensure your next run uses your newly changed values.
Leave the parameters in their initial values for now. Then consider the following questions. Each multi-part question will account for 20% of your grade.
Question 1: On-policy Prediction (Policy Evaluation)
Using the default settings (control=False, onpolicy=True), run the simulation until the table of maximum values has approximately stabilized at about 75,000 total steps, and then answer these questions: (Q1.1) Describe and interpret the maximum values along the right edge of the space and along row 2 (middle row), and explain these values briefly. (Q1.2) Close and restart the simulation, watching the table of max values as the simulation progresses, and report your impressions about the order in which states converged to their correct values, providing a brief explanation of your observations. (Q1.3) Considering the stabilized action-specific value table from the end of this run, consider this question: if this value table was in control of the agent, which path would the agent actually follow over the first 5 steps? (Hint: consider the value of the different actions along the left edge of the space.) Briefly explain why it would do this.
You may now find it convenient to leave the windows from the above run visible and then conduct the next requested run. The best way to do this is to open a new command window, cd in it to the ReinforcementLearningHomework directory, and run the program again, after making suggested changes to model parameters. Alternatively you can suspend a job using ‘^Z’ (ctrl-z) and then run another job in the same command window. Multiple jobs can be suspended. With all jobs run from a window suspended, you can restart a suspended job by entering %n where n is its job number. To see the job numbers of the currently suspended jobs, type jobs at the command prompt.
Question 2: Off-policy Prediction
Run the code after editing homework.py and setting onpolicy=False (don’t forget to save the file before starting the run!). (Q2.1) Describe the resulting table of maximum values, and note key differences from the one you obtained with onpolicy=True. (Q2.2) Consider the action-specific value table. If this value table was in control of the agent, what policy would it follow? Explain why the values in this table are different from those when with onpolicy=True. (Q2,3) Which setting of onpolicy gives the agent more exact knowledge of the actual states of the world, and which setting would allow it to obtain rewards maximally efficiently if it behaved in a completely greedy fashion after learning?
Question 3: Control
Run the code with onpolicy=False and control=True. (Q3.1) What do you observe about the convergence of values for different actions in different states? Explain in terms of the agent’s experiences learning with control=True and contrast with what happens when control=False.
Run the code with onpolicy=True and control=True. (Q3.2) What path does this algorithm find to the goal, and how is it different from the one found in 3.1? Even across multiple runs, this difference should be consistent. (Q3.3) What causes this difference? (Q3.4) What are the pros and cons of the policies found by on-policy and off-policy algorithms? HINT: look at the print-outs of rewards and deaths, noting that these are obtained while acting randomly 20% of the time (epsilon = .2). Consider also what would happen if acting optimally (epsilon=0).
Question 4: Undiscounted Returns
For these questions, Set gamma=1.0 -- no discounting future rewards, and set epsilon back to its default value of .2. Run the code with onpolicy=False and control=False. (Q4.1) What value are all of the states converging towards? Why?
(Q4.2) Think about whether the same phenomenon would occur with on-policy learning. After thinking about it, run the code with onpolicy=True and control=False. (Q4.3) Why is value smoothly decreasing from the goal, despite the lack of discounting?
Question 5: Further Explorations
Choose one of the following, noting that you may need to try several runs with each choice of parameter settings to get a feel for what is happening.
(Q5a) Exploring the role of epsilon. With control = True, gamma = .9) and
alpha = .1, try rerunning the code with onpolicy=True and different values of epsilon including 0. (Q5a.1) Does the resulting policy ever look like the one found by off-policy learning? (Q5a.2) What do you notice about the relationship between the time to convergence and epsilon? Your results with a given value of epsilon may vary considerably from run to run, particularly when epsilon is very small. Can you explain why? (Q5a.3) Based on what you have observed, would it make sense to gradually reduce epsilon over the course of learning? Why or why not?
(Q5b) More on undiscounted returns. With control = True, gamma = 1, alpha = .1, epsilon = .2, Run the code several times with both settings of onpolicy. Q(5b.1) Does either algorithm consistently find a good solution? Lower the learning rate, alpha=.01, and run the code for both settings of onpolicy. One algorithm should now be able to find a good solution. (Q5b.2) Report which algorithm finds a reasonable solition and which one does not. Why does the unsuccessful algorithm fail? (HINT: look at the values of the different actions in each state to help you figure out what is going on.