Boston Housing Data

·  Regression Tree

·  CART: Classification and Regression Trees

·  Target is CONTINUOUS

·  Split based on F statistic P-value

House Value

N = n1 + n2 obs.

NOx Low Y1, Y2, Y3, …. Yn1 SSE(1)

NOx High Y n1+1, Y n1+2, Y n1+3, …. YN SSE(2)

Recall:

SSE for { 1 3 2 6 } is (1-3)2+(3-3)2+(2-3)2 + (6-3)2=14 where 3 is the average.

Alternatively 1+9+4+36 – (12)2/4=50-36 = 14

[SS(total) –SSE(1)-SSE(2) ] / 1df = F numerator

MSE = [SSE(1) + SSE(2)] / (N-2)df = F denominator

p-value = Pr>F.

(# possible splits)(p-value) = Kass adjusted p-value

-Log10 [(# possible splits)(p-value) ] = logworth of split

Keep on splitting as usual.

(1) Add the BOSTON data source from our AAEM library.

(2) Use median house value as the target, NOx (environment) and RM (avg. # rooms in houses) as inputs. Reject everything else. Explore (at least) the variables RM and NOx to get their range. What happens if you click on a histogram bar?

(3) (optionally split into training and validation) Create a new diagram.

(4) Drag in a tree node, connect, run, and view results.

(A) Click “Exported data” in the properties panel.

(B) Click “train” to view training data results

(C) Actions->plot->3 D plot (color=NODE, X=RM, Y=NOX, Z=MEDV)

(5) (optional) Make a grid – put this in a code node and run it (where did that funky name &em_export_score come from?).

( Code Editor -> Macro Variables (subtab at top)->Exports->EM_EXPORT_SCORE)

data &em_export_score;

do nox=0.35 to 0.9 by 0.025;

do rm = 3.5 to 9 by 0.25;

output; end; end;

proc print; run;

(6) From the ASSESS subtab, drag in a score node and connect the tree and code nodes to it. Update and run. From the properties menu, select Exported data… then select the SCORE data set and click on Explore at the bottom. Use the graphing icon to make a 3-D plot of P_MEDV (Y) versus RM and NOx. Use EM_PREDICTION as a color variable. What kind of predictions do you see?

Boston Housing II

Herein I describe how you can export the scoring code and use it within SAS (not EM) to score another dataset that has the inputs and most likely does not have the target variable. This means that anyone with SAS can score a data set with your code. Notice that the code is created within EM so a person without EM cannot create a tree, they can just score a data set using your tree.

(1) Click on the tree node. Select results.

(2) From the top menu bar select view-> scoring -> SAS code. The created code opens in a window.

(3) Activate (click on top banner) the window containing the code. From the menu select

Edit->select all then Edit->copy.

(4) Get into SAS. You could be in VCL or you could launch SAS from your desktop. Go to the program editor and paste the copied code into it.

(5) Before the included code, type this:

Data score;

Do rm = 3.5 to 9 by 0.25;

Do NOx = 0.35 to 0.90 by 0.025;

(6) After the included code, type this:

output; end; end;

proc print data=score; run;

proc sort data=score; by _NODE_;

proc means data=score;

var P_MEDV RM NOx;

by _NODE_;

run;

(7) Make a 3D plot using this code in SAS:

PROC G3D;

PLOT RM*NOX=P_MEDV;

RUN;

You can try plot options rotate=15 and tilt=30 for different views.