COSC 6342 Machine Learning Homework1spring 2014

Dr. Eick

COSC 6342“Machine Learning”Homework1Spring 2014

Group Homework

Due date: March 1, 2014 at 11p

Last updated: February 11, 2014 at 10a

1. Reinforcement Learning

a) What are the main differences between supervised learning and reinforcement learning? [4]

b) Give the Bellman equation for all states of theRST world[1]! [3]

c) Assume you use temporal difference learning in conjunction with a random policy which choses actions randomly assuming a uniform distribution. Do you believe that the estimations obtained are a good measurement of the “goodness” of states, that tell an intelligent agent (assume the agent is smart!!) what states he/she should/should not visit?

Show that the VC dimension of the circles hypothesis class is 3 in 2 dimensions!
One major challenge when learning prediction and classification models is to avoid overfitting. What is overfitting? What factors contribute to overfitting? What is the generalization error? What is the challenge in determining the generalization error? Briefly describe one approach to determine the generalization error.
A lot of decision making systems use Bayes’ theorem relying on conditional independence assumptions—what are those assumptions exactly? Why are they made? What is the problem with making those assumptions?3-6 sentences!
Assume we have a problem in which you have to choose between 3 decisions D1, D2, D3. The loss function is: 11=0, 22=0, 33=0, 12=1,13=10, 21=10, 31=2, 23=10,32=2; write the optimal decision rule! Decision rule! (ik is the cost ofchoosing Ci when the correct answer is Ck.). If you visualize the decision rule nicely you get 25% extra credit.
What does bias measure; what does variance measure? Assume we have a model with a high bias and a low variance—what does this mean?3-4 sentences!
Assume we have a single attribute classification problem involving two classes C1 and C2 with the following priors: P(C1)=0.8 and P(C2)=0.2. Give the decision rule[2] assuming: p(x|C1)~(0,10); p(x|C2)~(1,1)Decision rule!
What are the main differences between the MAP and the Maximum Likelihood approach to parametric density estimation?2-4 sentences!
Assume we have a dataset with 3 attributes and the following covariance matrix :

1 0 0

0 4 -3

0 -39

a) What are the correlations between the three attributes?

b) Assume we construct 3-dimensional normal distribution for this dataset by using equation 5.9 assuming that the mean is =(0,0,0). Compute the probability[3] of the three vectors: (1,1,0), (1,0,1) and (0,1,1)!

c) Compute the Mahalanobis distance between the vectors (1,1,0), (1,0,1) and (0,1,1). Also compute the Mahalanobis distance between (1,1,-1) and the three vectors (1,0,0), (0.1.0). (0,0,-1). How do these results differ from using Euclidean distance? Try to explain why particular pairs of vectors are closer/further away from each other when using Mahalanobis distance. What advantages do you see in using Mahalanobis distance of Euclidean distance?

10. PCA—compute the three principal component for the covariance matrix of Problem 9 (use any procedure you like). What percentage of the variance do the first 2 principal component capture? Analyze the coordinate system that is formed by the first 2 principle components and try to explain why the third principal component would be removed, based on the knowledge you gained by answering Problem 9!

[1] It was used in Project3 in 2013: see website for a link!

[2] Write the rule in the form: If x>…then…else if …else…!

[3] Actually you report the values of the multivariate density function for the 3 points which is “kind of” a proxy for the likelihood near the query point.