CS188 Discussion – Utility Theory & MDP by Nuttapong Chentanez

Utilities

- Function mapping from outcomes (states of the world) to real numbers that describes an agent’s preference

Maximum Expected Utility

- An agent should chose the action which maximizes its expected utility given its knowledge

Expectation

Preferences

An agent choose among

Outcome: A, B.

Lotteries (uncertain prizes)

Constraints on the preferences of a rational agent

Without these constraints, irrational behavior may result.

Theorem: Rational preferences imply behavior describable as maximization of expected utility. The utility has the following two properties:

Utility is NOT unique

Without lotteries, any monotonic transformation of a utility will be utility as well.

Insurance

Consider a lottery [0.5, $1000; 0.5, $0], the expected monetary value is $500 vs. a certain alternative [1, $400]. The difference of $100 is called insurance premium.It is used in insurance industry to figure out the premium.

Money != Utility

– For most people [1, $1M] > [0.5, $3M; 0.5, $0]

- A: [0.8,$4k; 0.2,$0], B: [1.0,$3k; 0.0,$0]

C: [0.2,$4k; 0.8,$0], D: [0.25,$3k; 0.75,$0]

For most people, B >A, C > D, but

B > A  U($3k) > 0.8 U($4k)

C > D  0.8 U($4k) > U($3k)

-Either human is irrational or money != utility or both

Markov Decision Process (MDP)

-Consists of set of states s  S

-Transition model model, T(s, a, s’) = P(s’ | s, a)

- Probability that executing action a in state s leads to s’

-Reward function R(s, a, s’),

- Can also write Renter(s’) for reward for entering into s’

- Or Rin(s) for being in s

-A start state (or a distribution)

-May be terminal states

Optimum behavior

-Suppose we have a policy : S ->A, gives an action for each state

-Optimum policy maximizes expected utility (utility is related to rewards obtained by following the policy)

Stationarity

To formalize utility, need some assumptions. Typically stationary preferences:

“If a future starting tomorrow is preferred another future, then starting the former future today does not change the preference”

With this stationarity assumption, can show that there are two ways to define stationary utilities:

Additive utility:

Discounted utility:

If 0 <  < 1,

Utility of a state under a policy 

V(s) = expected total (discounted) rewards starting in s and following 

Idea 1, turn the recursive equations into updates

Idea 2, it’s just a linear system, solve it with Matlab

Optimal Utilities

Goal: Calculate the optimal utility of each state

V*(s) = expected (discounted) rewards with optimal actions

Why? Given the optimal V, we can find optimal policy

Bellman’s Equation

Idea: Optimal rewards = maximize over first action and then follow optimal policy

*(s) =

How to solve this equation?

Value Iteration

Theorem: Will converge to a unique optimal values. Policy may converge long before values do.

Convergence

Can be shown with contraction property of the update:

This equation implies that after the update, the two values function will be closer to each other. The update applied to V* not change its values, therefore, updating arbitrary values function will get it closer to V* by a factor of .

Can be shown by expanding maxs [Ut+1(s)– Vt+1(s)] with the value iteration update

To check for convergence, use

Once the change in our approximation is small, it must also be close to correct

Policy Iteration

Policy Evaluation: Calculate utilities for a fixed policy

Policy Improvement: Update policy based on resulting converged utility

Repeat until policy does not change

In practice, no need to compute “exact” utility of a policy just rough estimate is enough.

Exercise

Suppose an agent inhabits a world with two states, S and ~S, can do exactly one of two actions, a and b. Action a does nothing and action b flips from one state to the other.

Suppose Rin(S) = 3, Rin(~S) = 2, and  = 0.5, complete the following table to find the optimum policy with policy iteration