Assignment 6: 674: Introduction to Data Mining
This assignment requires you to write your own code.I strongly recommendteams of two and starting ASAP for this assignment.
For this assignment your goal is to build a classifier based on association patterns for the reuters dataset. The first algorithm to do so is quite simple. For the first algorithm you are asked to use any freely available software on association rule mining (see kdnuggets.com).
ALGORITHM 1:
Step 0 : Data transformation -- you need to transform the dataset into a transactional model -- you probably already have done this in the past (time to dust the cobwebs off ).
Article ID Set of Keywords Class Label
(note here you will need to represent the feature vector not as a binary vector but simply include the keywords you selected).
Please carefully check on the free software package to see additional information that you need to provide.
Step 1 : Generate all association rules above a minimum support and confidence threshold (use the free apriori software available at kdnuggets.com) on the training dataset. Prune out and keep only those rules of the form A .... B where B is a legal class label and A is a set of keywords (one or more). Rank order the rules according to confidence.
Step 2 : On each transaction of the test data, identify the most confident rule that applies (i.e. the entire antecedent lies in the transaction). The consequent of that rule is the class-label that will be assigned to that particular transaction. If no rules apply then: either reduce the support criteria and/or confidence criteria and re-do all steps, or use a default class label.
ALGORITHM 2:
For the second algorithm I would like you to first cluster the dataset into 8 and 16 clusters (using only the feature vector to cluster), respectively. You are expected to simply re-use code from the previous assignment. You may use any clustering algorithm you like. You should not use the class labels as part of the clustering step. Then I would like you to build a separate classifier (as above Steps 0-2) for each of the clusters.
Finally, I would like you to provide a detailed comparison of the threeapproaches in terms of performance accuracy.
What you need to turn in:
A description of the data transformations, pseudo-code, free software used. A description of the parameter values used and its impact on execution time of the classification process. For example a lower minimum support value will significantly impact both the time to process the training data as well as the time to classify new transactions. A comparison of the two approaches should be clearly laid out in the report
Due Date: December 6th 11:59AM Submit Code: lab 6