第1頁共9頁

Machine Learning

Final Exam.

Student No.: Name: 104/6

一、是非題(30%)

( )1.Principal components analysis (PCA) and linear discriminant analysis (LDA) are both supervised dimensionality reduction method.

( )2.Clustering methods find correlations between variables and thus group variables.Dimensionality reductionmethods find similarities between instances and thus group instances.

( )3.k-meansclustering is alocalsearch procedure, and the final cluster meansmiare highly depend on the initial mi.Moreover, the k-means clustering algorithm is used to solve the supervised learning problem.

( )4.Locally linear embedding recovers global nonlinear structure from locally linear fits. Its assumptions are (1) each local patch of the manifold can be approximated linearly. (2) Given enough data, each point can be written as a linear, weighted sum of its neighbors.

( )5.Isomap uses the geodesic distances between all pairs of data points.For neighboring points that are close in the input space,Euclidean distance can be used.For faraway points, geodesicdistance is approximated by thesum of the distances between the points along the way over the manifold.

( )6.“Similar inputs have similar outputs” is one assumption of nonparameteric method.

( )7.In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.

( )8. The impurity measure of a classification tree should be satisfies the following properties:
(1) , (2), and (3)

( )9.A decision tree is a hierarchical model using a divided-and-conquer strategy.

( )10. To remove subtrees in a decision tree,postpruning is faster andprepruning is more accurate.

( )11. Being a discriminant-based method, the SVM cares only about the instances close to the boundary and discards those that lie in the interior.

( )12. Entropy in information theory specifies the maximum number of bits needed to encode the classification accuracy of an instance.

( )13. Knowledge of any sort related to the application should be built into the network structure whenever possible. These are called hints.

( )14. In a multilayer perceptron, if the number of hidden units is less than the number of inputs, the first layer performs a dimensionality reduction.

( )15. In SIMD machines, different processors may execute different instructions on different data.

二、簡答題

  1. (3%) What is the difference between feature selection methods and feature extraction methods?
  1. (2%) Can you explain what Isomap is?
  1. (3%) What are the differences between the parametric density estimation methods and the nonparametric density estimation methods?
  1. (4%)LDA(Linear Discriminant Analysis) is a supervised method for dimensionality reduction for classification problems.What are the assumptions of LDAto fine the transformation matrix w?
  1. (4%) Draw two-class, two-dimensional data such that
    (a) PCA and LDA find the same direction and
    (b) PCA and LDA find totally different directions.
  1. (3%) Please explain the following algorithm.
  1. (2%) Condensed Nearest Neighboralgorithmis used to find a subset Z of X that is small and is accurate in classifying X.Please finish the following Condensed Nearest Neighboralgorithm.

  1. (3%) Given a two-dimensional dataset as follows, please show the dendrogram(樹狀圖) of thecomplete-link clustering result. The complete-linkdistance between two groups Gi and Gj:

where

  1. (3%) Given a k-nearest neighbor density estimate as follows:

where is the distance to the k-nearest sample, and N is the total sample number. Given the result of the k-nearest neighbor density estimator as follows. What is the value of k?

  1. (4%) In nonparametric regression, given a running mean smoother as follows, please finish the graph with h = 1.

where

  1. (6%) Given a regression tree as follows.
    (1) Please draw its corresponding regression result.
    (2) Could you show one rule which is extracted from this regression tree?

(3) In this case, what is the meanings of a leaf node and an internal node in a decision tree?

  1. (4%) In pairwise separation example as follows, and Hij indicates the hyperplane separate the examples of Ciand the examples of CjPlease decide each region belongs towhich class.

  1. (6%) Given a Classification tree construction algorithm as follows.

where (eq. 9.3) and (eq. 9.8)

Can you explain what the functions“GenerateTree”and “SplitAttribute” do?

  1. (3%) Please assign the weights of the multilayer perceptron to solve the following problem.
  1. (3%) In neural network, can we have more than one hidden layers? Why or why not?
  1. (3%) Why is a neural network overtraining (or overfitting)?
  1. (4%) (1) What are support vectors in support vector machine?

(2) Given an example as follows. Please show the supports vectors.

三、計算證明題

  1. (5%) Using principal components analysis, we can find a low-dimensional space such that when x is projected there, information loss is minimized. Let the projection of x on the direction of wis z= wTx. The PCA will find w such that Var(z) is maximized

Var(z) = wT∑w where Var(x)= E[(x – μ)(x –μ)T] = ∑

If z1=w1TxwithCov(x) = ∑then Var(z1) = w1T∑w1, and maximize Var(z1) subject to ||w1||=1. Please show that the first principal component is the eignvector of the covariance matrix of the input sample with the largest eigenvalue.

  1. (5%) Given a sample of two classes, , where if and if . In logistic discrimination, assume that the log likelihood ratio is linear in two classes case, the estimator of is the sigmoid function

We assume , given , is Bernoulli distribution. Then the sample likelihood is

, and the cross-entropy is

Please find the update equations of and ,

where,and , .

1