Homework – Decision trees.
(1) There should be a data set called Breast_cancer in your AAEM library.
If not, run the program Breast_Cancer.sas from our demos, being sure to insert your library name in the place indicated so that Enterprise Miner will find the data. Put it in the same library you use for demos, AAEM for example. Alternatively, you can copy the data from our class web page DATA link if you prefer. This is real data from a study in Wisconsin attempting to distinguish benign from malignant cancer using biopsy data. The labels in the data set tell what the variables are. The idea is to take some descriptive information on the cells to try to predict if the sample is malignant (target=1) or not (target=0). No comments needed here in part (1) or in part (2).
(2) Run a decision tree on the resulting data. In the data, Target is a binary variable with 0 for benign and 1 for malignant samples. Be sure to declare Target to be binary – otherwise EM will consider it to be an interval variable even though it has just 2 values (for example, if you ask people how often they have been ticketed for parking at a fire hydrant, you might well get only 0 and 1 responses even though it could be any nonnegative integer). Use 75% of the data for training and the other 25% for pruning. Build the decision tree using the default settings except for the splitting.
(3) This is all you hand in: Report what you did and what you found in a readable report. Keep to 4 pages or less. Include (of course) whatever tables & graphs you think are helpful but don’t overdo it. Most readers will not want to see every piece of output EM produces. Think about an executive level report to a grant review team reviewing your work. Be sure to include a summary including your evaluation of whether this decision tree model is a good predictor or not.
Three short answer questions to address as part of your report:
(a) What would you do differently if you were computing estimates of Pr{malignant} rather than making a decision, benign versus malignant? (You do not need to run it or show what the results would have been)
(b) Show the variable importance table (view->model->Variable Importance) and comment on why someone would be interested in it.