4.1

Chapter 4: Mathematical Expectation

Diagram from Chapter 1:

In Chapter 3, we learned about some PDFs and how they are used to quantify the distribution of items in a population. In this chapter, we are going to summarize information in the population relative to the PDFs. Specifically, we are going to look at the “expected value” of a random variable and find it using its PDF. These expected values are going to represent parameters. Thus, they will summarize items in the population.

4.1: Mean of a Random Variable

Given the use of a PDF for a population, we may want to know what value of X we would expect on average to obtain. This is found through the expected value of X, E(X).

Definition 4.1: Let X be a random variable with PDF f(x). The population mean or expected value of X is

 = E(X) =

when X is discrete, and

 = E(X) =

when X is continuous.

Notes:

  • In calculus you learned about something similar. It may have been called “moment about the y-axis” (Faires and Faires, 1988).
  • You will often see  written as X to emphasize that the expected value is for the random variable X. This notation is helpful when there are multiple random variables.
  • In the discrete case, you can think of  as a weighted average. The weight for each x is f(x) = P(X=x).
  •  is a parameter since it summarizes possible values of a population.

Example: Let’s Play Plinko! (plinko.xls)

Let X be a random variable denoting the amount won for one chip. Below are 5 different PDFs for the amount won.

Drop Plinko chip in column above:
$100 / $500 / $1,000 / $0 / $10,000
x / f(x) / f(x) / f(x) / f(x) / f(x)
$100 / 0.1667 / 0.1339 / 0.0822 / 0.0359 / 0.0176
$500 / 0.3571 / 0.3080 / 0.2204 / 0.1287 / 0.0882
$1,000 / 0.2976 / 0.2991 / 0.2862 / 0.2545 / 0.2353
$0 / 0.1429 / 0.1920 / 0.2796 / 0.3713 / 0.4118
$10,000 / 0.0357 / 0.0670 / 0.1316 / 0.2096 / 0.2471
 / $850.00 / $1,136.16 / $1,720.39 / $2,418.26 / $2,751.76

For example,  = E(X) = 1000.1667 + 5000.3571 + 10000.2976 + 00.1429 + 100000.0357 = 850

where T = {100, 500, 1000, 0, 10000}.

This is a good example of why you can think of  as a weighted average.

Questions:

  • What is the optimal place to drop the chip in order to maximize your expected winnings?
  • Compare what actually occurred in 2001 to what is expected:

Drop Plinko chip in column above:
$100 / $500 / $1,000 / $0 / $10,000
x / Count / Count / Count / Count / Count
$100 / 0 / 3 / 2 / 5 / 2
$500 / 0 / 3 / 12 / 5 / 6
$1,000 / 1 / 3 / 12 / 19 / 11
$0 / 0 / 2 / 9 / 21 / 8
$10,000 / 0 / 1 / 4 / 9 / 4
Total chips / 1 / 12 / 39 / 59 / 31
Average won / $1,000 / $1,233 / $1,492 / $1,898 / $1,748

Why are these averages different from what we would expect?

  • The sample size is small. As the sample size gets bigger, we would expect the sample average to approach the population mean.
  • There is variability from one sample to the next. This implies that we would not expect the same number of observed values for the years of 2002, 2003, … when the game is played. More will be discussed about variability later.
  • Possibly one of the underlying assumptions behind the calculation of the probabilities is incorrect. While we did not discuss these assumptions in class, the main one is that the probability of going to the left or right ONE slot is 0.5 as a plinko chip hits a peg on the board. If you watch the plinko game, you will notice that chips can often go more than ONE slot to the left or right after hitting a peg.

Example: Tire life (chapter4_calc_rev.mws)

The number of miles an automobile tire lasts before it reaches a critical point in tread wear can be represented by a PDF. Let X = the number of miles (in thousands) an automobile is driven until it reaches the critical tread wear point for one tire. Suppose the PDF for X is[unl1]

Find the expected number of miles a tire would lasts until it reaches the critical tread wear point. In other words, this will be the average lifetime for all tires.

 = E(X) = =

Use integration by parts:

Let u = x dv =

du = dx v = = -e-x/30

Then

Note that by L’hopital’s rule,

Then

Thus, one would expect the tire to last 30,000 miles on average.

In Maple,

assume(x>0);

f:=1/30*exp(-x/30);

plot(f,x=0..150, title = "Tire life PDF");

mu:=int(x*f,x=0..infinity);

On a TI-89:

Sometimes, functions of random variables are of interest when finding the expected value. Section 4.2 will discuss one very important example for when this is of interest. In general, here is how the expected value can be found:

Theorem 4.1: Let X be a random variable with PDF f(x). The mean or expected value of the random variable g(X) is

g(X) = E[g(X)] =

when X is discrete, and

g(X) = E[g(X)] =

when X is continuous.

Be very careful here! The function, g(X), is NOT a PDF. It is a function of a random variable. Remember that Chapter 3 often would use g(x) to denote a marginal PDF.

Example: Tire life (chapter4_calc_rev.mws)

Find E[|X-30|]. What does this mean in terms of the problem?[unl2]

In Maple,

int(abs(x-30)*f,x=0..infinity);

evalf(int(abs(x-30)*f,x=0..infinity));

Find E(X2). What does this mean in terms of the problem?[unl3]

E(X2) = =

Use integration by parts twice to obtain:

E(X2) = 1800

Notice that E(X2)  [E(X)]2 = E(X)2

In Maple,

int(x^2*f,x=0..infinity);

Corollary: E(aX+b) = aE(X) + b for some constants a and b.

pf:

Example: Tire life

While this example is not necessarily realistic, we can still do it to illustrate the corollary.

Find E[2X+1] = 2E[X] + 1 = 230 + 1 = 61.

Theorem 4.2: Let X and Y be random variables with joint PDF f(x,y). The mean or expected value of the random variable g(X,Y) is

g(X,Y) = E[g(X,Y)] =

when X and Y are discrete, and

g(X,Y) = E[g(X,Y)] =

when X and Y are continuous.

See the examples in Section 4.1 on your own.

There is one particular case which will be important in the next section - when g(X,Y) = XY. In the continuous case, notice the difference between E(XY) and E(X)E(Y):

When are these two quantities equal?

Notice how E(X) and E(Y) are found.

Example: A sample for the tire life PDF (example_sample_tire.xls in Chapter 3)

Below is a sample that comes from a population characterized by the PDF of

for the tire life example.

Tire
Number / x
1 / 8.7826
2 / 2.8102
3 / 71.202
4 / 16.55
5 / 23.581
6 / 2.1657
7 / 36.784
8 / 14.432
9 / 27.69
10 / 10.743

999 / 34.008
1000 / 52.044

Questions:

  • How could you estimate E(X) using this sample and what would you expect it to be approximately?[t4]
  • How could you estimate E(X2) using this sampleand what would you expect it to be approximately?

4.2: Variance and Covariance

In addition to knowing what we would expect X to be on average, we may want to know something about the “variability” of X from . For example in the tire tread wear example, we want to numerically quantify the average deviation from the mean for X. We already did this in terms of E[|X-30|] = E[|X-|]. There is another way more often used for doing this in terms of the squared deviation, E[(X-)2].

Definition 4.3: Let X be a random variable with PDF f(x) and mean . The variance of X is

2 = E[(X-)2] =

when X is discrete, and

2 = E[(X-)2] =

when X is continuous. The positive square root of the variance, , is called the standard deviation of X.

Notes:

  • The variance is the expected average squared deviation of X from .
  • This an example of using Theorem 4.1 with
    g(x) = (x-)2.
  • E[(X-)2] is equivalently written as E{[X-E(X)]2} so that there is an expectation within an expectation. Once the E(X) is found, it is a constant value as has been shown in the last section.
  • Common notation that is often used here includes: Var(X) = = 2 = E[(X-)2]. The Var() part is just a nice way to replace the E[ ] part.
  • The subscript X on 2 is often helpful to use when there is more than one random variable under consideration. Thus, denotes the variance of X.
  • The reason for considering the positive square root of 2 is so that we can use the original units of X.
  • 20 and >0 for all possible values of X.

Theorem 4.2: The variance of a random variable X is

2= Var(X) = E(X2) –2 = E(X2) – [E(X)]2

pf:

E[(X-)2] = E(X2 - 2X + 2)

= E(X2) - 2E(X) + E(2)

= E(X2) - 22 + 2

= E(X2) - 2

Example: Let’s Play Plinko! (plinko.xls)

Let X be a random variable denoting the amount won for one chip. Find the variance and standard deviation for X.

Drop Plinko chip in column above:
$100 / $500 / $1,000 / $0 / $10,000
x / f(x) / f(x) / f(x) / f(x) / f(x)
$100 / 0.1667 / 0.1339 / 0.0822 / 0.0359 / 0.0176
$500 / 0.3571 / 0.3080 / 0.2204 / 0.1287 / 0.0882
$1,000 / 0.2976 / 0.2991 / 0.2862 / 0.2545 / 0.2353
$0 / 0.1429 / 0.1920 / 0.2796 / 0.3713 / 0.4118
$10,000 / 0.0357 / 0.0670 / 0.1316 / 0.2096 / 0.2471
 / $850.00 / $1,136.16 / $1,720.39 / $2,418.26 / $2,751.76
 / $1,799.31 / $2,404.79 / $3,246.57 / $3,923.92 / $4,170.28
 / -$2,748.61 / -$3,673.42 / -$4,772.75 / -$5,429.57 / -$5,588.79
 / $4,448.61 / $5,945.74 / $8,213.54 / $10,266.10 / $11,092.32
 / -$4,547.92 / -$6,078.21 / -$8,019.33 / -$9,353.49 / -$9,759.06
 / $6,247.92 / $8,350.54 / $11,460.12 / $14,190.01 / $15,262.59

For example when dropping the chip above the $100 reservoir (#1 or #9 from p. 3.21 of Section 3.1-3.3),
2 = E[(X-)2] =
= (100-850)20.1667 + (500-850)20.3571 +

(1000-850)20.2976+ (0-850)20.1429 +

(10000-850)20.0357

= 3,237,500 dollars2

where T = {100, 500, 1000, 0, 10000}.

To put this into units of just dollars, we take the positive square root to find  = $1,799.31.

What does this value really mean?

“Rule of thumb” for the # of standard deviations all possible observations (data) lies from its mean: 2 or 3. This is an ad-hoc interpretation of Chebyshev’s Rule (Section 4.4) and the Empirical Rule (not in our book). This rule of thumb is discussed now to help you understand standard deviations.

When someone drops one chip from the top of the Plinko board, I would generally expect the amount of money they will win to be between  - 2 and +2. A more conservative expected amount would be  - 3 and +3.

Examine the application of the rule of thumb in the table and how it makes sense when the chip is dropped above the $100 spot[unl5].

How are these calculations done in Excel? See the Plinko Probabilities worksheet of plinko.xls. Note that the negative values are represented by a “( )” instead of a negative sign.

Example: Time it takes to get to class

What time should you leave your home for class in order to make sure that you are never (or rarely) late? [t6]

Example: Tire life (chapter4_calc_rev.mws)

The number of miles an automobile tire lasts before it reaches a critical point in tread wear can be represented by a PDF. Let X = the number of miles (in thousands) an automobile is driven until it reaches the critical tread wear point for one tire. Suppose the PDF for X is[unl7]

Find the variance and standard deviation for the number of tire miles driven until it reaches its critical wear point.

2 = Var(X) = E[(X-)2] = = where = 30. Instead of doing this integral, it is often a little easier to work with 2 = E(X2) - 2. Previously on p. 4.12, we found E(X2) = 1800. Thus, 2 = 1800 – 302 = 900!

In Maple,

int((x-mu)^2*f,x=0..infinity);

int(x^2*f,x=0..infinity) - mu^2;

On TI-89,

Notice that  = = 30. Putting this it terms of thousands of miles, the standard deviation is 30,000 miles which is the same as the mean here. Generally for random variables, the mean and the standard deviation will not be the same!!! This just happens to be a special PDF called the “Exponential PDF” where this will always occur (see p. 166 of the book). Using the rule of thumb,

-2 = 30 - 230 = -30 and +2 = 30 + 230 = 90

and

-3 = 30 - 330 = -60 and +3 = 30 + 330 = 120

Thus, one would expect all of the tires to be between -30 and 90. A more conservative range would be -60 and 120. Of course, the negative values do not make sense for this problem. However, examine where the upper bound values fall on the PDF plot. Make sure you understand why these upper values make sense in terms of the plot!

Compare  = 30 to E(|X-|) = 22.07 found earlier.

Suppose the PDF was changed to

In this case, one can show that E(X) = 15, Var(X) = 2 = 225, and  = 15. The PDF is shown below.

Below is the same plot with the original PDF of f(x) = (1/30)e-x/30 for x>0 where the y and x-axis scales are drawn on exactly the same scale.

Notice the =30 plot has a PDF more spread out than the =15 plot. This is because the standard deviation (and variance) is larger for the =30 plot!

Question: Given the choice between a tire with ==30 and a tire with ==15, which would you choose?

Other common g(X) functions used with E[g(X)]:

  • The skewness for a random variable X is defined to be

E[(X-)3]/E[(X-)2]3/2 = E[(X-)3]/3

This quantity measures the lack of symmetry in a PDF. For example, the PDF shown on p. 4.6 is very skewed (non-symmetric). The PDF shown on p. 3.30 of the Section 3.1-3.3 notes is symmetric.

Note that E[(X-)2]3/2 E[(X-)3]  E[(X-)]3

  • The kurtosis of a random variable X is defined to be

E[(X-)4]/E[(X-)2]2 = E[(X-)4]/4

This quantity measures the amount of peakedness or flatness of a PDF. For example, the red PDF below has a higher peak than the blue PDF which is more flat.

Theorem 4.3: Let X be a random variable with PDF f(x). The variance of the random variable g(X) is

when X is discrete, and

when X is continuous.

Examine the examples in Section 4.2 on your own.

Covariance and correlation

Suppose there are two random variables, X and Y. It is often of interest to determine if they are independent (remember Chapter 2). If they are dependent, then we would like quantify the amount and strength of dependence. Also, we would be interested in the type (positive or negative) of dependence.

Positive dependence means that “large” values of X tend to occur with “large” values of Y. “Small” values of X tend to occur with “small” values of Y. If we could plot all values in a population, the dependence would look like this:

Negative dependence means that “large” values of X tend to occur with “small” values of Y. “Small” values of X tend to occur with “large” values of Y. If we could plot all values in a population, the dependence would look like this:

Thus, positive dependence means as values of X increase, Y tends to increase as well (they move in the same direction). Negative dependence means as values of X increase, Y tends to decrease (they move in an opposite direction).

Example: High school and college GPA

Suppose I had a joint PDF which quantified the possible values for which high school and college GPAs can take on. Let X = the student’s high school GPA and Y = the student’s college GPA.

Questions:

  • Would you expect there to be a relationship between X and Y? In other words, are X and Y independent or dependent?
  • If they are dependent, would you expect there to be a strong or weak amount of dependence?
  • If they are dependent, would you expect a positive or negative dependence? What would positive and negative dependence mean in terms of the problem?

The numerical measure of the dependence between two random variables is called the “covariance”. It is denoted symbolically by XY when X and Y denote the two random variables. Below are a few notes about it:

  • XY = 0 when there is independence.
  • XY > 0 when there is positive dependence
  • XY < 0 when there is negative dependence
  • The further away XY is from 0, the stronger the dependence.

Definition 4.4: Let X and Y be random variables with joint PDF f(x,y). Suppose E(X) = X and E(Y) = Y. The covariance of X and Y is

when X and Y are discrete, and

when X and Y are continuous.

Common notation that is often used for the covariance is Cov(X,Y) = XY.

Below is a simplifying formula similar to the one used for the variance.

Theorem 4.4: The covariance of two random variables X and Y with means X and Y, respectively, is given by

XY = E(XY) –XY = E(XY) – E(X)E(Y)

pf:

E[(X-X)(Y-Y)] = E[XY - YX - XY + XY]

= E[XY] - XE[Y] - YE[X] + E[XY]

= E[XY] - XY - YX + XY

= E[XY] - XY

Note that E(XY) XY = E(X)E(Y) except under a particular condition to be discussed in the next section.

There is one problem with the covariance. The measure of strength of dependence (how far it is from 0) is not necessarily bounded above or below. The correlation coefficient, denoted by XY, fixes this problem. It is the covariance divided by the standard deviations of X and Y in order to provide a numerical value that is always between -1 and 1. Below are a few notes about it:

  • -1 XY 1
  • XY = 0 when there is independence.
  • XY > 0 when there is positive dependence
  • XY < 0 when there is negative dependence
  • The closer to 1 that XY is, the stronger the positive dependence.
  • The closer to -1 that XY is, the stronger the negative dependence.
  • When X = Y, XY = 1. More generally, when X = a + bY for constants a and b>0, XY = 1.
  • When X = -Y, XY = -1. More generally, when X = a + bY for constants a and b<0, XY = -1.

Definition 4.5: Let X and Y be random variables with covariance XY and standard deviations X and Y, respectively. The correlation coefficientfor X and Y is

Sometimes one will see this denoted as Corr(X,Y).

Example: Grades for two courses (chapter4_calc_rev.mws)

Let X be a random variable denoting grade in a math course and Y be a random variable denoting grade in a statistics course. Suppose the joint PDF is

Find the covariance and correlation coefficient to numerically measure the dependence between X and Y.

To find the covariance, I am going to use the XY = E(XY) - XY expression.

Find X:

This could have been found a little easier using results from Section 3.4. In that section, we found g(x) = x2 + 2/3 was the marginal PDF for X. Thus,

Find Y:

Find E(XY):

Since both X and Y are involved in the expectation, the joint PDF must be used.

Then XY = E(XY) - XY = 3/8 – (7/12)(2/3)

= 3/8 – 7/18 = 27/72 – 28/72 = -1/72 = -0.0139

To use in order to find E(XY), one needs to split the problem into two parts up. First, one needs to find the inner integral. Since I had decided previously to integrate first with respect to y last time, I will do it again. However, I need to integrate with respect to ”x” on integrals.com. So suppose is rewritten as . Thus, “a” represents “x” and “x” represents “y” in the original integral. To find the inner integral,

Then

Again, to use integrals.com, I need to integrate with respect to x. Thus, replace a in the above expression with x and enter the result into appropriate box on the web page.

Then

and the same result is obtained as before. This idea of splitting up the integral into two parts can also be used when using calculators that do not allow for multiple integration.

From TI-89,

All of these calculations are much easier in Maple!

fxy:=x^2+2*y^2;

int(int(fxy,x=0..1),y=0..1);

E(XY):=int(int(x*y*fxy,x=0..1),
y=0..1);

E(X):=int(int(x*fxy,x=0..1),y=0..1);

E(Y):=int(int(y*fxy,x=0..1),y=0..1);

Cov(X,Y):=E(XY)-E(X)*E(Y);

Cov(X,Y):=int(int((x-E(X))*
(y-E(Y))*fxy,x=0..1),y=0..1);

evalf(Cov(X,Y));

evalf(Cov(X,Y),4);

To find the correlation coefficient, I need to find the variances of X and Y in addition to the covariance between X and Y. To do this, I am going to use the shortcut formulas of Var(X) = = E(X2) - and Var(Y) = = E(Y2) - . Since the individual means have already been found, I just need to find E(X2) and E(Y2).

Find E(X2):

Find Var(X):

Var(X) = E(X2) - = 19/45 – (7/12)2 = 0.0819

Find E(Y2):

Find Var(Y):

Var(Y) = E(Y2) - = 23/45 – (2/3)2 = 3/45 = 1/15 = 0.0667

Then

From Maple:

Var(X):=int(int((x-E(X))^2*fxy,

x=0..1),y=0..1);

Var(Y):=int(int((y-E(Y))^2*fxy,

x=0..1),y=0..1);

Corr(X,Y):=Cov(X,Y) /
sqrt(Var(X)*Var(Y));

evalf(Corr(X,Y),4);

Describe the relationship between math course grade (X) and stat course grade (Y):

  • Are math and stat course grades independent or dependent? Explain.
  • If they are dependent, is there a strong or weak amount of dependence?
  • If they are dependent, is there a positive or negative relationship between math and stat course grades?

On an exam, I may just ask you to describe the relationship between two random variables instead of prompting you with the above questions. In your explanation, you should still address these types of questions!

What we have developed is a way to understand the relationship between two different random variables. Where would this be useful? Suppose you want to study the relationships between:

  • Humidity and temperature
  • ACT and SAT score
  • White and red blood cell counts
  • Winning percentage and the number of yards per game on offense for NFL teams

4.3: Means and Variances of Linear Combinations of Random Variables

This section discusses some items we have already discussed and also some new items. The main purpose here is for you to get comfortable with finding expectations and variances of functions of random variables.

Theorem 4.5: If a and b are constant, then

E(aX+b) = aE(X) + b

See p. 4.13 for where this was first introduced. Note what happens if a and/or b is equal to 0.

Theorem 4.6: The expected value of the sum or difference of two or more functions of a random variable X is the sum or difference of the expected values of the functions. That is,

E[g(X)  h(X)] = E[g(X)]  E[h(X)]

For example, let g(X) = aX2+bX+c and h(X) = dX+f for some constants a,b,c,d, and f. Then

E[g(X) - h(X)] = E(aX2 + bX + c - dX - f)

= aE(X2)+ bE(X) + E(c) - dE(X) - E(f)

= aE(X2)+ bE(X) + c - dE(X) - f