Lab assignment 1: 671: Introduction to Databases II Due: Tuesday Nov 10, Midnight

The goal of this assignment is to take in a transactional dataset and a list of input patterns and compute the support and confidence of these listed patterns and output the results to an output file. This is an individual lab assignment. The basic objectives are to:

1.  Read in the files (InputAssocA, InputAssocB) and interpret the items, store in a data structure.

2.  Compute the support and confidence of each pattern as described in class.

3.  Output results to the file OutputAssocA – the format of this file is listed below.

The specifications for the input and output files are:

InputAssocA will be of the form

TID #items in transaction List of Items in transaction

1  4 A1, A5, B1, B5

2  3 A4, D25, D3

……………..

We will assume up to a 100 items A1-A25 … D1-D25, upto at most 1000 transactions. The length of a transaction can be up to 100 (all items bought).

InputAssocB will be of the form

Pattern ID Pattern

1  [A1, A2] [D5]

2  [A5] [D2, C3]

……….

We will assume up to a 1000 patterns can be listed. The first pattern listed above should be interpreted as [A1,A2] à D5. Note that the LHS and RHS of a pattern cannot overlap (intersection is null). Both the LHS and RHS can have multiple items listed.

OutputAssocA will be of the form

Pattern ID Pattern Support Confidencej

1  [A1, A2] [D5] 23% 50%

2  [A5] [D2, C3] 29% 30%

………………………………………………………………………….

You have complete choice over the programming environment with the only caveat being it should be readily executable on the stdsun environment. The TA should be able to compile and execute your code. Note that the TA will have other sample test files (so your code should execute for those as well as long as they conform to the specification above.).

What you must submit your files using the submit command (see below for a sample invocation).

submit c671aa lab1 <files-to-submit>

The files-to-submit should include:

i)  a README file (describing your code and describing any optimizations)

ii)  source code file(s) (e.g. assoc.cpp)

iii)  a Makefile (if you have one – not necessary)

iv)  sample test inputs (you used for testing – if you include several label them as InputAssocB.1, InputAssocB.2 and so on)

Note that:

i)  All of the files must be submitted using one command. If you try to submit a second time it will erase all of what you submitted earlier.

ii)  DO NOT SUBMIT object files.

iii)  You will receive NO score if your code does not compile/execute on stdsun.

Design questions that should be included in the README file:

i)  How do you represent the data internally in your program?

ii)  What is the pseudo code of your approach.

iii)  How did you test your approach.

iv)  Is there a more efficient way to compute results than the simple naïve approach? (this one is optional – if you find a nice solution you will be given bonus points).