Hierarchical Relational Reinforcement Learning

Meg Aycinena

Department of Computer Science

StanfordUniversity

Stanford, California, 94305

1

Abstract

Experiments with relational reinforcement learning, a combination of reinforcement learning and inductive logic programming, are presented. This technique offers greater expressive power than that offered by traditional reinforcement learning. We use it to find general solutions in the blocks world domain. We discuss some difficulties associated with relational reinforcement learning, specifically its decreased effectiveness when presented with more complex problems. Finally, we explore ways in which hierarchical methods and learning subroutines may be able to overcome some of these weaknesses.

1. Introduction

Reinforcement learning can be a powerful machine learning technique, but it suffers from an inability to generalize well over large state spaces, or to reuse the results of one learning session on a similar, but slightly different or more complicated problem. Džeroski et al. [Džeroski et al., 1998], [Džeroski et al., 2001], have developed a way to combine reinforcement learning with inductive logic programming, to produce a more flexible, expressive approach that circumvents many of the problems that have plagued traditional reinforcement learning. Specifically, the method modifies the traditional Q-learning technique to storeso that it stores the Q-values for of state-action pairs in a top-down induced logical decision tree.,which The method allows for the introduction use of variables, eliminates the need for an explicit, memory- expensive state-action table, and generalizes already learned knowledge to unseen parts of the state space.

Although relational reinforcement learning greatly improves upon traditional reinforcement learning, it tends to work less effectively with more complicated goals. Its problem solving ability in the classic blocks world domain, for example, is impressive with simple goals such as clearing one block, or placing one block on top of another. However, when faced with a more complex goal such as building an ordered tower of blocks, relational reinforcement learning works less effectively than its traditional reinforcement learning precursor.[Meg: isn’t this comparing apples with oranges? The “traditional reinforcement learning precursor” led to huge state-action tables that could only solve problems with specific, named blocks. RRL could solve problems with arbitrarily named blocks.]

The introduction of a hierarchical learning strategy may harness the effectiveness of relational reinforcement learning in solving simpler problems to the solution of more complicated goals. Our hypothesis states that by first learning a set of useful subroutines and then learning to combine these subroutines to achieve a stated goal, an agent may potentially converge more consistently and more quickly upon an optimal solution.

Some work has been done on hierarchical learning in traditional reinforcement learning [Kaelbling et al., 1996], but has been hindered due to the lack of variablized solutions to subproblems and the inability to generalize those solutions to slightly different problems. Because relational reinforcement learning can produce variablized solutions to subproblems and because it has been shown to perform well on simpler problems, hierarchical learning is easier to implement and possibly more powerful than its traditional counterpart. Hierarchical relational reinforcement learning promises to exploit the advantages and overcome some of the weaknesses of relational reinforcement learning in order to improve the actual process of learning, as well as the applicability of the learned solution. In this paper, we investigate whether this actually occurs.

The paper is organized as follows. In Section 2 we describe traditional reinforcement learning, Q-learning, and how P-learning extends it. In Section 3 we briefly summarize inductive logic programming, and describe top-down induction of logical decision trees, as implemented in the TILDE system, [Blockeel, 1998]. In Section 4 we describe relational reinforcement learning, and how we have implemented it according to the design of [Džeroski et al., 2001]. Section 5 describes a simple implementation strategy for hierarchical relational reinforcement learning, and Section 6 briefly offers the results of the preliminary experiments we have conducted in order to explore its effectiveness. Section 7 concludes and discusses possible further work.

2. Traditional reinforcement learning, Q-learning, and P-learning

The term “reinforcement learning” and some of the general characteristics of the technique have been borrowed from cognitive psychology, but in recent years it has become increasingly popular within the computer science disciplines of artificial intelligence and machine learning. See [Kaelbling et al., 1996] for a broad survey of the field as it has been applied within computer science. In its simplest form, reinforcement learning describes any machine learning technique in which a system learns a policy to achieve a goal through a series of trial-and-error training sessions, receiving a reward or punishment after each session, and learning from this “reinforcement” in future sessions. (Note: we shall use the terms “training session” and “episode” interchangeably to refer to one cycle of state-action sequences culminating in a goal state.)

Formally, [Kaelbling et al., 1996] describe the reinforcement learning problem as follows:

  • a discrete set of environment states, S,
  • a discrete set of agent actions, A, and
  • a set of scalar reinforcement signals, typically {0, 1}, or the real numbers.

The goal of the agent is to find a policy π, mapping states to actions, that maximizes some long-run measure of reinforcement. We will denote the optimal policy as π*. Various techniques exist within reinforcement learning for choosing an action to execute in a particular state. Q-learning is one of the more traditional of these techniques.

Q-learning

Q-learning falls within the “model-free” branch of the reinforcement learning family, because it does not require the agent to learn a model of the world or environment (i.e. how actions lead from state to state, and how states give rewards) in order to learn an optimal policy. Instead, the agent interacts with the environment by executing actions and perceiving the results they produce. In Q-learning, each possible state-action pair is assigned a quality value, or “Q-value” and actions are chosen by selecting the action in a particular state which has the highest Q-value.

The process works as follows. The agent can be in one state of the environment at a time. In a given state siS, the agent selects an action aiAto execute according to its policy π. Executing this action puts the agent in a new state si+1, and it receives the reward ri+1 associated with the new state. The policy in Q-learning is given by:

π(s) = arg maxaQ(s, a),

and thus by learning an optimal function Q*(s, a) such that we maximize the total reward r, we can learn the optimal policy π*(s). We update Q(s, a) after taking each action so that it better approximates the optimal function Q*(s, a). We do this with the following equation:

Q (si, ai) := ri + maxa’Q (si+1, a’),

[Meg: you should explain somewhere what a’ refers to. And you probably want a' instead of a’. That is ‘prime’ instead of a single closed-quote mark.]

where  is the discount factor, 0 <  < 1.

In the first few training sessions, the agent has not yet learned even a crude approximation of the optimal Q-function, and therefore the initial actions must will be random if the initial Q’s are. However, due to this randomness, the danger exists that, although the Q-function may eventually converge to a solution to the problem, it may not find the optimal solution to the problem, because the initial path it stumbles upon may be a correct but sub-optimal path to the goal. Therefore it is necessary that every reinforcement learning agent have some policy of exploration of its environment, so that, if a better solution exists than the one it has already found, it may stumble upon it.

One popular method (e.g. [Džeroski et al., 2001]) for arbitrating the tradeoff between exploitation (using knowledge already learned) and exploration (acting randomly in the hopes of learning new knowledge) is to choose each action with a probability proportionate to its Q-value. Specifically:

Pr (ai | s) = T-Q (s, ai)/jT-Q (s, aj) .

[Meg: can you fix the subscripts so they aren’t so low?]

The temperature, T, starts at a high value and is decreased as learning goes on. At the beginning, the high value of T ensures that some actions associated with low Q-values are taken – encouraging exploration. As the temperature is decreased, most actions are those associated with high Q-values – encouraging exploitation. This is the method of choosing actions that we will use.

P-learning

Like Q-learning, P-learning is a technique for choosing actions in a particular state, presented by [Džeroski et al., 2001]. Indeed, P-learning is almost identical to Q-learning, except that rather than remembering a real-valued scalar to represent the quality of each state-action pair, the agent only remembers a 1 if the action in the given state is optimal, and a 0 if the action is nonoptimal. Formally:

if a π*(s) then P(s, a) = 1 else P(s, a) = 0.

[Meg: since you are talking about optimal actions here, shouldn’t you be using P*? And, what’s the rationale (throughout) for your use of boldface symbols?]

In practice, the P function is computed from the Q function (which must also be learned), but it can be stored more compactly and used later more easily. The P function is expressed in terms of the Q function as follows:

if a arg maxaQ(s, a) then P(s, a) = 1 else

P(s, a) = 0.

We can also incorporate exploration into P-learning through the analogous equation to the one for choosing actions in Q-learning:

Pr (ai | s) = T-P (s, ai)/jT-P (s, aj) .

[Fix subscripts]

We will see that P-learning, although it is built directly on top of Q-learning and may seem rather redundant, produces smaller, simpler, and more efficient representations of the optimal solution and therefore is extremely valuable towards our goal of creating hierarchical programs.

3. Inductive logic programming and top-down induction of logical decision trees

Inductive logic programming (ILP) refers to a subdiscipline of machine learning that deals with learning logic programs. ILP attempts to use reasoning to derive useful and general principles, rules, and relationships from a set of specific data instances, hence the use of the term “inductive”. This contrasts with the well-established field of deductive reasoning, in which a conclusion is inferred from a given set of premises through the application of accepted logical rules. In most cases, ILP has been used to learn concepts so that a set of unseen examples may be classified correctly.

Decision trees

One popular method for performing the classification of examples is the decision tree. A decision tree consists of a set of nodes, each of which is either an internal node (meaning it has a set of other nodes as its children), or a leaf node (in which case it has no children).

Each internal node contains a test that is used to split the possible set of examples into several mutually exclusive groups. The node then passes the current example down the branch corresponding to the result of the test. Often, the tests return Boolean results, so the tree is binary, with each node classifying the example as either passing or failing its particular test. Decision trees with this property are called binary decision trees.

Leaf nodes do not contain tests, but rather hold the classification label for examples that are sorted into that node. Thus an example can be deterministically classified by a decision tree.

Logical decision trees

According to [Blockeel and De Raedt, 1997], logical decision trees are binary decision trees which fulfill the following two properties:

  • The tests in the internal nodes are first-order logical conjunctions of literals.
  • A variable that is introduced in a particular node (i.e. one which does not exist higher in the tree), can only occur in the subtree corresponding to successful tests.

The second requirement is necessary due to the fact that if a test referring to a newly introduced variable fails, the agent cannot be sure that the variable refers to anything at all, and therefore it does not make sense to continue to refer to it.

We illustrate the concept of logical decision trees with this simple example, taken from [Džeroski et al., 2001]:

This logical decision tree determines whether all the blocks are unstacked in the 3-blocks world.

[Meg: All of your other figures have numbers. How come this one doesn’t?]

The fact that we are using variables also necessitates the use of more complicated tests than those used in the propositional case. Internal nodes must contain a query including both the unique predicate introduced in the current node, and the predicates introduced in all the parent nodes of the current node. The following process for associating clauses and queries to nodes is adapted from [Blockeel and De Raedt, 1997]. We assume that examples sorted down the tree are passed to the left subtree if the node’s test returns true, and to the right subtree if it returns false. [Meg: This convention is different than the one used in the figure above. Is that a problem?] We also assume that N.test returns the unique predicate assigned to the node N:

  • With the root node T, the clause c0 ← T.test and the empty query are associated.
  • For every internal node N with associated clause ci ← P and associated query Q, the query P is associated with its left subtree. The query Q, not(ci) is associated with the right subtree.
  • With each internal node N with associated query Q, the clause ci ← Q, N.test is associated; i is a number unique for this node.
  • Each leaf node N has no associated test, but only the query Q that it inherited from its parent.

It is important to note that we associate the queryQ, not(ci) with the right subtree, and not the expected query Q, not(N.test), which would be symmetric with the left subtree. This is necessary because the queries of the left and right subtrees must be complementary; only one of the two queries associated with the two children of each node must succeed on any example passing through that node. Because it is possible that the N.test shares variables with Q, the two queries Q, N.test and Q, not(N.test) are not necessarily complementary for all possible substitutions. This distinction does not exist in the propositional case. For a more detailed description of the logical inequivalence of the two queries, see ([Blockeel and De Raedt, 1997], p.4-6).

[Meg: I didn’t find the bulleted items and the paragraph that followed them very clear. As a matter of fact, I don’t think I understand them at all. If you are assuming that the reader has read and understands Blockeel and De Raedt, this might be ok as a quick summary. But if you want your paper to be at all self-contained (which I think would be a good idea), it would be important to explain all this. And an example of a tree with tests and queries would be important. (I don’t understand the difference between a test and a query.] I’d be glad to have you explain all this to me in person, and then maybe I could give some advice about explaining it in writing.]

Thus logical decision trees allow the classification of examples described in first order (rather than propositional or attribute value) logic, and therefore offer much greater expressiveness and flexibility than traditional decision trees. This characteristic makes them ideal candidates for use in relational reinforcement learning, which seeks to allow variablization and generalization over similar inputs.

Top-down induction of logical decision trees

In order to use logical decision trees for classification, however, we must have an algorithm for inducing them from a set of example instances. [Blockeel and De Raedt, 1997] offer the following algorithm, based on the classic TDIDT (top-down induction of decision trees) algorithm, but modified to induce logical decision trees. It assumes as input a set of labeled examples E, background knowledge B, and a language L specifying what kind of tests may be used in the decision tree. Both B and L are assumed to be available implicitly.

buildtree (T, E, Q):

ifE is homogeneous enough

then

K := most_frequent_class (E)

T := leaf (K)

else

Qb := best element of ρ(Q), according to some heuristic

T.test := C’ whereQb ← Q, C’

E1 := { EE | EUB |=Qb }

E2 := { EE | EUB |≠Qb }

buildtree (T.left, E1, Q’ )

buildtree (T.right, E2, Q )

[Replace quotes by primes in above (and throughout)]

The refinement operator ρ determines the language bias (and thus serves the role of the language L). [Blockeel et. al., 2001] use a refinement operator that adds one or more literals to the clause under θ-subsumption. A clause c1 subsumes another clause c2 if and only if there is a substitution θ such that c1 θc2. In other words, a refinement operator under θ-subsumption maps a clause onto a set of clauses, each of which is subsumed by the initial clause subsumes. It is from this set of new clauses that the conjunction to be put in a new node is chosen.

[Can you think of a short example to illustrate how the algorithm works?]

The TILDE system, which [Blockeel and De Raedt, 1997], [Blockeel, 1998], and [Blockeel et. al., 2001] introduce and describe, implements the algorithm above. It can be used to induce both classification and regression logical decision trees, using the equivalent regression tree system TILDE-RT. (Regression trees are identical in definition to classification trees, except that they contain real-valued numbers in their leaves rather than classifier labels.) We will see how logical decision trees, induced with the TILDE system, can be used to store Q and P-values, while introducing variables and allowing for generalization in the reinforcement learning process.

4. Relational reinforcement learning

As described in section 2, traditional reinforcement learning stores the learned Q-values in an explicit state-action table, with one value for each possible combination of the state and action. The relational reinforcement learning method introduced by [Džeroski et al., 1998] and [Džeroski et al., 2001] modifies this aspect of reinforcement learning in order to introduce variables and to increase the ability of the agent to generalize learned knowledge to unseen parts of the state space. Rather than using an explicit state-action table, relational reinforcement learning stores the Q-values in a logical regression tree. There are several important advantages to this approach:

  • It allows for larger or even infinite state spaces, or state spaces of unknown size.
  • By using a relational representation, it takes advantage of the structural aspects of the task.
  • It provides a way to apply the learned solution to of one problem towards to other similar or more complicated problems.

Thus relational reinforcement learning offers the potential to overcome many of the serious drawbacks that traditional reinforcement learning has presented in the past.