Lemma 1The Projection Direction Obtained by Maximizing the Fishercriterion Is Proportional

تمرین دسته بندی کننده های خطی

Show that LDA can be seen as a least squares method for classification. In particular,prove the following lemma:

Lemma 1The projection directionobtained by maximizing the Fishercriterion is proportional to the weight vectorobtained by minimizing the least squares loss withthe affine function

Then

Sketch two multimodal distributions (i.e., each class should have multiple areas of concentration) for which a linear discriminant could give excellent (or even optimal) classification accuracy. Sketch two unimodal distributions (i.e., each classis concentrated in a single area) for which even the best linear discriminant would give poor classificationaccuracy. You may need to look up the ideas “multimodal distributions" and "unimodal distributions"to complete this problem.

For an SVM, if we remove one of the support vectors from the trainingset, does the size of the maximum margin decrease, stay the same, or increase for that dataset? Why?Also justify your answer by providing a simple dataset (no more than 2-D) in which you identify thesupport vectors, draw the location of the maximum margin hyperplane, remove one of the supportvectors, and draw the location of the resulting maximum margin hyperplane.

The quadratic kernel is equivalent to mapping each x into a higher dimensional space where

for the case where. Now consider the cubic kernel. Note thatthis kernel adds 1 to the dot product. What is the corresponding function, again for the case where?

Suppose that we believe some training points are more important than others. That is, as usual, wehave data with corresponding labels ; however, we also have importance weights. There are two ways we can try to incorporate these weights into the SVM formulation:(1) by rescaling the margin; (2) by rescaling the loss. We will look at both in this exercise.

a)By “rescaling the margin", we mean that instead of forcing each data point n to achieve a marginon one, we force each data point to have a margin of. (For simplicity, if you wish to leave off the bias term, which is acceptable.) Write down the corresponding primal optimization problem. does this compare to the standard SVM formulation?

b)Only By \rescaling the loss", we mean that each data point gets a separate slack control.In other words, our soft-margin classifier will have the form. Repeat theprevious sequence, eventually getting down to the dual formulation. How does this compare tothe standard SVM?

Finally, discuss (in a few sentences) what the difference between rescaling the margin and rescaling theloss is.