Designing a Database to Store Data Mining Datasets and Data Mining Results 6340-Lab2-Sp05
Christoph F. Eick
Lab2 Spring 2005
Design of a Database for Data Mining
Last updated: March 29, 2005
Due date: Sa., April 2, 9:00pm (electronic submission, no extensions)
Remark: this is an evolving document; this is an individual project (each student must develop his/her own solution; collaborating with other students is not allowed!)
The goal of the Lab2 project is to design and implement a relational database that used to store the following information:
1. “Raw” UCI[1] and other datasets and their associated metadata. Dataset metadata include attributes in the datasets, classes used, and textual information that further describes what the dataset contains and how the dataset was created. Datasets are subdivided into numerical datasets that contain d numerical attributes and 1 nominal class attribute[2], and symbolic datasets[3] that contain d symbolic attributes and a class variable. Insert the Vehicle[4], IRIS Plant, the Wyoming Poverty Status Dataset[5] (http://www2.cs.uh.edu/~kwee/research/datasets/index.htm), and one symbolic dataset (of your own choice) into the database you create[6].
2. to store the results of particular data mining algorithms; namely
o clusters obtained using a given clustering algorithm (e.g. K-means)
o results of using 2 additional data mining techniques chosen from the following set of data mining techniques: decision trees, association rules, neural networks, k-nearest neighbor classifiers, support vector machines, and belief networks. Moreover, the storing of experimental metadata, such as datasets used, experimental evaluation parameters (e.g. accuracy or cluster tightness), and data mining algorithms used (including their parameter settings) in the experiments.
o Moreover, populate the database with some useful data that illustrate how meta-data and results of the application of your 3 supported data mining tasks are stored.
3. Moreover, add capabilities for similarity assessment to your system: the creation and storage of a distance matrix for a given dataset should be supported[7].
Remarks: Lab3 will center either on implementing a particular data mining algorithm or on adding other capabilities on the top of the database you design in Lab2. Moreover, Lab2 is more open in its specification. Feel free to ask Dr. Eick for permission, if you like to support other capabilities within the database you design in lieu of the capabilities that are mentioned above.
Deliverables: Write a report that discusses the major design decisions you made also discussing alternatives in the design you did not choose; also mention all assumptions you made in your database design. Also include an E-R diagram of the database designed with explanation, the relational database schema, explain how the storing of data mining results for the 3 chosen data mining techniques is supported in your database, and explain how meta-data can be associated with datasets in the database you designed. Moreover, discuss what similarity assessment capabilities are supported, and discuss other supported capabilities (if there are any capabilities) of the system you designed. Also give a brief description of all software you developed on the top of the designed database, including printout of all program code you developed (does not need to be documented). Also prepare a 5-10 minute demo that demonstrates the benefits and capabilities of the database designed (demos will likely be scheduled in the April 11 week).
[1] http://www.ics.uci.edu/~mlearn/MLRepository.html
[2] Also be aware of the fact that datasets are usually bags and not sets.
[3] If you support storing association rules in the database to be designed, store a dataset of your own choice that is suitable for association rule mining as your “symbolic” dataset.
[4] ftp://ftp.ics.uci.edu/pub/machine-learning-databases/statlog/vehicle/
[5] Either use the original or the modified poverty dataset (either one is fine)
[6] Store all the examples for at least 2 datasets in the database.
[7] It is okay, if this capability is only support for one kind of dataset (either symbolic or numerical)