This Question Relates to the BUPA Liver Disorders Dataset

CSE 674 Introduction to Data Mining: Sample Midterm

Instructions:

Answer all questions (three sections).

You have 80 minutes for 75 points (roughly a minute a point)

State and underline any assumptions you make.

Write down final answers in the space provided.

Use back of sheets for rough work.

Good Luck!

Section I: Short Answer Questions (4X3 = 12 points, 12 minutes)

Note in this section you either know how to answer it or you do not – do not waste time guessing (as there will be no points for guesses).

I.A For a particular decision tree classification problem Amy suggests using Gini as a measure of node impurity, John counters with Entropy and argues that it is more efficient to compute. Do you agree with John or Amy? Defend your answer.

I.B A particular problem (say clustering) requires one to repeatedly compute and use distances between points in the dataset. The number of data points is 100,000 and the dimensionality of the data is 5. John suggests using the “data matrix” representation while Amy counters with the “distance matrix” representation. Who would you agree with and why? Defend your answer.

I.C. A new problem (same scenario as I.B., same suggestions by both) but here the number of data points is a 1000 and the dimensionality of the dataset is 8000. Who would you agree with and why?

I.D. John eyeballs the above dataset and argues that if the small dot points were to be rotated by 90 degrees (i.e. the dominant eigenvector rotated by 90 degrees), the Mahalanobis distance between the two oval points (see arrows) would be higher. Amy disagrees. Who is right? Provide an intuitive explanation.

Section IIClassification (20+20 = 40 points, 35 minutes)

II.A. You are on an island (note this is a very different island than the one your past colleagues have been on). To survive you must eat mushrooms indigenous to the island. You landed as a party of eight; five of your party were adventurous and tried a mushroom. Three of them are very ill. You have the following data to consider. 1 represents yes and 0 represents no in this table. You have a resident probabilist and he suggested applying a Naïve Bayes classification algorithm to predict the toxicity of the remaining 3 mushrooms. Please apply this algorithm (showing all steps) and underline your final predictions.

ExampleIsHeavyIsSmellyIsSpottedIsPoisonous

A0000

B0111

C1100

D1 001

E1111

F001?

G010?

H101?

II.B You are given the following information about a classifier (it’s a black box you know nothing about the classifier) and its performance on a particular dataset. There are two class labels IN and OUT. The classifier has three settings (think of these as three different models) and your goal is to sort them in order of model performance (highest to lowest) in descending order. Highest will be the one you will pick as the best performing. Please include all steps for full credit.

Setting A Setting BSetting C

# correctly labeled IN 100 #correctly labeled IN 90 #correctly labeled IN 95

#correctly labeled OUT 40 #correctly labeled OUT 50 #correctly labeled OUT 45

#incorrectly labeled IN 20#incorrectly labeled IN 30 #incorrectly labeled IN 35

#incorrectly labeled OUT 40#incorrectly labeled OUT 30 #incorrectly labeled OUT25

II.B.1 Compute the accuracy of each setting and sort (e.g. Setting A>B>C).

II.B.2 Compute the precision of each setting and sort.

II.B.3. Compute the recall of each setting and sort.

II.B.4. Compute the F-measure of each setting and sort.

II.B.5 Assuming the following cost matrix

TP -1FN 10

FP 5TN 0

Assume TP corresponds to the situation where both the model and actual values correspond to IN. Compute the cost of each model and sort (remember you want to place the model you would pick first)

Section III: Clustering

(25 minutes, 23 points = 20+3)

This question relates to the BUPA Liver Disorders dataset:

Relevant information:

-- The first 5 variables are all blood tests which are thought

to be sensitive to liver disorders that might arise from

excessive alcohol consumption. Each line in the data file

constitutes the record of a single male individual.

Number of instances: 6

Number of attributes: 7 overall

Attribute information:

1. mcv mean corpuscular volume

2. alkphos alkaline phosphotase

3. sgpt alamine aminotransferase

4. sgot aspartate aminotransferase

5. gammagt gamma-glutamyl transpeptidase

6. drinks number of half-pint equivalents of alcoholic beverages drunk per day

7. selector field used to split data into two sets (those with disorders (1) versus those without disorders (2))

Missing values: none

SAMPLE FROM LIVER DATASET

Mcv alkphos sgpt sgot gamgt drinks selector

PT185,92,45,27,31,0.0,1

PT285,54,47,33,22,0.5,2

PT396,67,26,26,36,0.5,2

PT498,99,57,45,65,20.0,1

PT591,57,33,23,12,8.0,1

PT691,63,25,26,15,6.0,1

Q III.A. Cluster the above data using min-link hierarchical clustering (exclude the use of the selector attribute – use all other attributes). Use the manhattan distance metric (do not normalize the data!). Draw the dendogram of the resulting hierarchical clustering.

QIII.B Comment on the quality of clustering w.r.t the selector variable if: i) you had three clusters; and ii) you had two clusters. What measure did you use?