A former student in data mining, Adam Morris, supplied this information on 7 types of bats. As you may know, bats are flying mammals that are active at night. The challenge is to identify the variety of bat using their calls. As you also may know, bats navigate by sending out calls and listening for their echos. Some information on this “echolocation” process is given at
The calls of several bats of each of the 7 types have been characterized by a set of their features as described below. In Adam’s e-mail, that data are described as follows:
The target variable is species (there are 7 species).
Labo= Eastern red bat
Nyhu= Evening bat
Pisu = Tricolored bat
Epfu = Big brown bat
hoary = Hoary bat
Myau = Northern Long-eared bat
Tabr = LeConte Free-tail bat
There are 11 continuous variables (features), which are parameters measured from sonograms of the echolocation recordings. The species identifications were made by comparing each sonogram with known-species reference calls (by eye). Since this is incredibly tedious, a quantitative sorting algorithm would be quite useful. (note the large number of bats for which this was done)
Measured Call Characteristics:
dur = duration
pre = preceding interval
highf = high frequency
lowf = low frequency
band = bandwidth
fmaxamp = frequency of maximum amplitude
maxamp = maximum amplitude (% duration)
slope = overall slope
heel = location of heel if present
upper = upper slope (if heel is present)
lower = lower slope (if heel present)
Click on the link to get the SAS program that reads in the data and gets you started on part 2. Note that it assumes proportional priors (i.e. a representative sample). You do not need to organize a nice report this time but please use complete sentences/paragraphs to answer these questions. The tasks are:
(1) Describe the data: What are the counts and percentages of the 7 species of bat? Assuming this is a representative sample what are the most common and rarest species?
(2) Run the Fisher Linear Discriminant function in the program for identifying the species of bat. Look at the SAS log window. What is the message that seems to indicate a problem? (optional for stat majors – make a guess at what is going on before continuing)
Note the regression of BAND on HIGHF and LOWF. Report the two coefficients and t tests for these two variables. What strikes you as unusual here? What is the estimate of the error standard deviation? The variable BAND is rounded to the nearest ______(complete the sentence). Is it possible that there is just roundoff error rather than random statistical error? Explain. Put all of this together into an explanation of the error message in the SAS log.
3. SAS was apparently sophisticated enough to do a workaround for the problem above but since we understand what is going on, we realize that we can just leave out BAND and eliminate the problem without affecting our ability to discriminate. You should understand that statement. Now run the discriminant analysis using only variables dur pre highflowffmaxampmaxamp slope heel upper lowerand answer these questions:
(A) For the Fisher Linear discriminant function, what assumptions are made about the seven
covariance matrices?
(B) How many rows and columns does each of these seven covariance matrices have?
(C) Besides the intercept, how many coefficients does each discriminant function involve? Would this
answer change if there were more features? Would it change if there were more species?
(D) Why are the discriminant numbers (Fj in our notes) different for different individual bats from the
same species? Is it thecoefficients, the features, or both?
(E) Suppose (for simplicity) that a bat’s discriminant functions wereF1=2 for comparing to Labo and
Fj=1 for j=2,3,…,7 for comparing to each of the other 6 species. What is the (posterior) probability
that this bat is a Labo bat (Eastern Red bat)?
(F) Suppose (again for simplicity) that we have a bat whose sonogram trace has highf=10 and all other
features equal to 0. Find from your Fisher Linear Discriminant functionoutput, the discriminant
numbers(Fj in the notes) for comparing this bat to each of the seven species’ distribution (7
numbers).
(G) How many Epfu bats where accidentally classified as Labo and how many Labo bats were
accidentally classified as Epfu using your linear discriminant function?
(H) How would you change your code to force PROC DISCRIM to run a quadratic discriminant function?
Under what conditions would you prefer quadratic to linear? (Note the relationship of this question
to question 3A).
(I) Test to see if a quadratic discriminant function is needed by changing your SAS code appropriately.
Report the result. Run a quadratic discriminant function and compare the misclassification rate to
that of the linear discriminant function by showing both rates.