Data Mining and Knowledge Discovery (KSE525)

Assignment #3 (April 25, 2018, due: May 9)

1. [10 points] Consider the data set shown below.

Record / A / B / C / Class
1
2
3
4
5
6
7
8
9
10 / 0
0
0
0
0
1
1
1
1
1 / 0
0
1
1
0
0
0
0
1
0 / 0
1
1
1
1
1
1
1
1
1 / +
-
-
-
+
+
-
-
+
+

1)Estimate the conditional probabilities for , , , , , and .

2)Use the estimate of conditional probabilities given in the previous question to predict the class label for a test sample () using the naïve Bayes approach.

3)Estimate the conditional probabilities using the m-estimate approach, with and .

4)Repeat 2) using the conditional probabilities given in 3).

5)Compare the two methods for estimating probabilities. Which method is better and why?

2. [6points] Suppose that you have the following data set. How does the k-nearest neighbor algorithm classify the instances 7 and 8, with k=1 and k=3 respectively? Use simple majority voting and the Manhattan distance.

Instance / x1 / x2 / Class
1 / 0.25 / 0.25 / +
2 / 0.25 / 0.75 / +
3 / 0.50 / 0.25 / -
4 / 0.50 / 0.75 / -
5 / 0.75 / 0.50 / -
6 / 0.75 / 1.00 / +
7 / 0.25 / 0.55 / ?
8 / 0.75 / 0.80 / ?

3. [4 points] A notable problem of the information gain is that it prefers attributes with a large number of distinct values. Explain why the information gain suffers from the problem and why the gain ratio does not.

Submit a single R Notebook (student-id.nb.html) that includes both answers for Q4 and Q5.

4. [20 points] Install R and then two packages party and randomForest. Answer the following questions using R.

1)Build a decision tree for the "GlaucomaM" data set and plot the decision tree. Use the whole data set for training. This data set is included in the "TH.data" package. [Hint: use the "ctree" function in the party package.]

2)Show the confusion matrix of the result of the decision tree. Please note that there are two class labels: glaucoma and normal.

3)Run the random forest algorithm with ntree=100 for the same data setand plot the error rates as the number of trees grows. [Hint: use the "plot" function for showing the error rates.]

4)Find out the most important variable for classification in this data set, according to the random forest built. [Hint: use the"varImpPlot" function.]

5. [10 points] Install R and then one package e1071. Answer the following questions using R.

1)Build anSVM model for the"iris" data set. Use the whole data set for training.

2)Plot the decision boundary on "Petal.Width"and"Petal.Length". Let’s set "Sepal.Width"to be 3 and"Sepal.Length"to be 4 for the plot. [Hint: use the "plot" function with providing formula and slice because the data set has four attributes rather than two.]

Manuals for R packages:

party:

randomForest:

e1071: