CPSC445b/545b(2008)Term Projects

Written Term ProjectReports are dueby Monday, April 28. They may be submitted earlier. You should also plan to make a 15-20 minute project presentation tothe class during class on April 1, 10, 15, 17, 22, or 24. We’re happy to take reservations for speaking times now.

Please turn your written reports into Jiang Du(). Make sure your name is on the cover sheet and you include an “executive summary” which outlines the problem you addressed, your approach, and a summary of your results/conclusions.

If you wish, you can work in teams of 2-4 people. But if you select to do a multi person project, you must accomplish proportionally more than a single person would. Each team may turn in one project report or individual reports. In either case, the team members should be clearly listed on the first page.

The following projects are examples. Some of them are very loosely defined which gives you the opportunity to be creative in driving the projects in directions that you find interesting. You may also completely define your own project. For example, you may wish to explore the use of datamining for an appropriate problem of your choice or you may wish to investigate a particular datamining algorithm or implementation.

Please send email to us (, )by Thursday, March 27 with a one paragraph description of your project. If it is a team project identify all the team members.

Some example potential projects (in no particular order)

(1)Bioinformatics application. Try to use R or Weka to reproduce the datamining resultsin one or more the following famous bioinformatics papers. How do the techniques proposed in the paper compare to other datamining techniques discussed this semester.

(2)Clinical datamining. A research group in Wisconsin has advocated using datamining approaches based on linear programming. Try to mine the Wisconsin Breast Cancer data set using some of the best approaches discussed this semester and compare with the results obtained by the linear programming approach. See the following web page for details:

How does its performance compare to existing Weka modules such as discrimination, Logistic Regression etc?

(3)Explore fast algorithms for computing Support Vector Machines, cf.

and apply to some interesting data sets.

(4)Explore fast algorithms for computing decision trees for large training sets, see

(5)Explore techniques for handling problems with training sets which are missing data, see

Try out these techniques on some training sets of interest by simulating the loss of data. Compare with how well these algorithms do with data sets that aren’t missing data.

(6)Bayesian Networks is a fashionable approach to many bioinformatics problems. Read the following papers and explore the use of winMine on some problems of interest. See and

(7)Attribute (feature) selection is an approach for dealing with problems whose training sets have a large number of attributes. Read the following survey paper and explore the effectiveness of this approach for some interesting problems. See

Explore the use of Genetic Algorithms for feature selection.

(8) (a) Explore the use of an OLOP repository with R. For example, Mandrian is an open source OLAP server that already has an R connector. Use Mandrian as a source of a training set for datamining packages in R.

(b) Explore the use of EXCEL/XLMINER as a front end to datamining packages in R.

(9) (a) Explore the use of Nonnegative Matrix Factorizations (NMF) for clustering in text mining or other problems, see and Develop and test an NMF package in R or Weka.

(b) Explore clustering algorithms for very large data sets, see and

(10)Investigate the use of global optimization, eg the genetic algorithmin datamining. Examples include featureselection and algorithm optimization. See

Explore the integration of this toolbox into R or Weka. How does it compare to the GA’s already in R and Weka?

(11)Explore the methods of “random projections” for classification problems. Compare the accuracy with methods in Weka. Implement a random projection method withyour favorite machine learning algorithm in Matlab/Octave or R and benchmark on some typical problems. See

(12)Explore datamining for one of the following applications:

(a)Sports, See

(b) Using Datamining for SPAM filtering, see

(c) Social sciences.See Richard Berks working papers on

(13) Explore the use of Rattle/R for some datamining problems of interest. Investigate the possibility of using RWeka for integrating Weka modules (in Java) into Rattle.

(14) See for numerous ideas for projects and references.

(15) Explore the use of R for Geographical Information Systems (GIS). See

(16) Explore the visualization of classification boundaries using theRggobi package in R. See andrelated papers.

(17) Visualization of "animated bubbles" ala Gapminder along with Rpackages. Hint: use "flash"

(18) Data mining of digital music. An active area, e.g. see:

and the following for material on processing music in R: