A former student in data mining, Adam Morris, supplied this information on 7 types of bats. As you may know, bats are flying mammals that are active at night. The challenge is to identify the variety of bat using their calls. As you also may know, bats navigate by sending out calls and listening for their echos. Some information on this “echolocation” process is given at

The calls of several bats of each of the 7 types have been characterized by a set of their features as described below. In Adam’s e-mail, that data are described as follows:

The target variable is species (there are 7 species).

Labo= Eastern red bat

Nyhu= Evening bat

Pisu = Tricolored bat

Epfu = Big brown bat

hoary = Hoary bat

Myau = Northern Long-eared bat

Tabr = LeConte Free-tail bat

There are 11 continuous variables (features), which are parameters measured from sonograms of the echolocation recordings. The species identifications were made by comparing each sonogram with known-species reference calls (by eye). Since this is incredibly tedious, a quantitative sorting algorithm would be quite useful. (note the large number of bats for which this was done)

Measured Call Characteristics:

dur = duration

pre = preceding interval

highf = high frequency

lowf = low frequency

band = bandwidth

fmaxamp = frequency of maximum amplitude

maxamp = maximum amplitude (% duration)

slope = overall slope

heel = location of heel if present

upper = upper slope (if heel is present)

lower = lower slope (if heel present)

Click on the link to get the SAS program that reads in the data and gets you started on part 2. Note that it assumes proportional priors (i.e. a representative sample). You do not need to organize a nice report this time but please use complete sentences/paragraphs to answer these questions. The tasks are:

(1) Describe the data: What are the counts and percentages of the 7 species of bat? Assuming this is a representative sample what are the most common and rarest species?

(2) Run the Fisher Linear Discriminant function in the program for identifying the species of bat. Look at the SAS log window. What is the message that seems to indicate a problem? (optional for stat majors – make a guess at what is going on before continuing)

Note the regression of BAND on HIGHF and LOWF. Report the two coefficients and t tests for these two variables. What strikes you as unusual here? What is the estimate of the error standard deviation? The variable BAND is rounded to the nearest ______(complete the sentence). Is it possible that there is just roundoff error rather than random statistical error? Explain. Put all of this together into an explanation of the error message in the SAS log.

3. SAS was apparently sophisticated enough to do a workaround for the problem above but since we understand what is going on, we realize that we can just leave out BAND and eliminate the problem without affecting our ability to discriminate. You should understand that statement. Now run the discriminant analysis using only variables dur pre highflowffmaxampmaxamp slope heel upper lowerand answer these questions:

(A) For the Fisher Linear discriminant function, what assumptions are made about the seven

covariance matrices?

(B) How many rows and columns does each of these seven covariance matrices have?

(C) Besides the intercept, how many coefficients does each discriminant function involve? Would this

answer change if there were more features? Would it change if there were more species?

(D) Why are the discriminant numbers (Fj in our notes) different for different individual bats from the

same species? Is it thecoefficients, the features, or both?

(E) Suppose (for simplicity) that a bat’s discriminant functions wereF1=2 for comparing to Labo and

Fj=1 for j=2,3,…,7 for comparing to each of the other 6 species. What is the (posterior) probability

that this bat is a Labo bat (Eastern Red bat)?

(F) Suppose (again for simplicity) that we have a bat whose sonogram trace has highf=10 and all other

features equal to 0. Find from your Fisher Linear Discriminant functionoutput, the discriminant

numbers(Fj in the notes) for comparing this bat to each of the seven species’ distribution (7

numbers).

(G) How many Epfu bats where accidentally classified as Labo and how many Labo bats were

accidentally classified as Epfu using your linear discriminant function?

(H) How would you change your code to force PROC DISCRIM to run a quadratic discriminant function?

Under what conditions would you prefer quadratic to linear? (Note the relationship of this question

to question 3A).

(I) Test to see if a quadratic discriminant function is needed by changing your SAS code appropriately.

Report the result. Run a quadratic discriminant function and compare the misclassification rate to

that of the linear discriminant function by showing both rates.