ITCS 6161/ITCS 8162: Knowledge Discovery in Databases

ITCS 6161/ITCS 8162: Knowledge Discovery in Databases

Assignment Instructions

Instructions:

Software required:

Putty:
WinSCP:
Oracle Virtual Box:
Cloudera:

For detailed description on how to install Cloudera, watch this video:

By default, Cloudera contains Eclipse and Hadoop packages installed which can be used to program MapReduce programs.Cloudera contains single node cluster.Use Cloudera to test your code on small inputs. For large inputs, use DSBA-cluster. Once you are confident that your code works correctly, run in the cluster.

******************************************************************************

To Log In to DSBA Hadoop Cluster follow the instructions below :

TASK – 1:Logging into Hadoop cluster and running simple commands

1. To Log-In to Hadoop via FTP client ( in order to copy and paste data and to view files )

Open your FTP Client(WinSCP)

Choose Session | New Session

File protocol SFTP :

Host Name : dsba-hadoop.uncc.edu

Type UserName and Password

and click Save | check the Save Password checkbox

2. Log-in To dsba-hadoop.uncc.edu via the Putty or ( in order to run commands )

3. Run sample text processing on the ListOfInputActionRules using GREP command.

ListOfInputActionRules is a text file containing one action rule per line.

For example:

(a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%]

(a, a1->a3) ^ (b, ->b1) -> (f, f1->f0) [3, 75%]

(a, a1->a3) ^ (c = c2) -> (f, f1->f0) [1, 80%]

(a, ->a3) ^ (b, b2->b1) -> (f, f1->f0) [3, 50%]

4.To move files from client to the cluster, use following command:

hadoop fs -put path-of-the-file-in-client path-of-the-destination-folder-in-cluster

5.Run following GREP command on ListOfInputActionRules to return all lines of text (ActionRules) which contain the word ‘ a1 ‘

hadoop org.apache.hadoop.examples.Grep input-path-of-ListOfInputActionRules-filepath-of-destination-folder ".*a1.*"

NOTE: The destination folder should not exist before running this command. To remove a folder, use following command,

hadoop fs -rm -r path-of-the-folder

6.To get the output folder back to the client, use following command,

hadoop fs -get path-of-the-output-folder-in-cluster path-of-the-folder-in-client

7.Repeat steps 4-6 for the Mammals book text file and return all lines of text which contain the word “mammal”. Download Mammals book text file here:

For TASK-2 and TASK-3, use Mammals book as an input file.

TASK – 2: Running WordCount

Read the "MapReduce Tutorial" from

Basic procedure to follow when executing a MapReduce program in a Hadoop cluster:

The inputs should be transferred to HDFS from the local system
The JAR file can reside in the FTP client side(i.e in WinSCP)

The output of MapReduce programs will be written on HDFS which can be transferred back to the local system

To understand how MapReduce works, you can see following links along with example

- for Hadoop version 1.0

All basic HDFS commands can be found here:

Create a new JAVA project in Cloudera Eclipse
Cloudera Eclipse contains a sample MapReduce project. That project consists of all required MapReduce jar files. Import all those .jar files into your project.
Copy WordCount v2.0 from into your project
Convert your project into a .jar file
Run the .jar file in a cluster and produce the output
Use the following command to run the .jar file
hadoop jar path-of-the-jar-file path-of-input-folder path-of output-folder

TASK – 3: Running modified version of WordCount

Download the .jar file from
Save it in the client
Run and produce the output

TASK – 4: Write-up comparing the results of TASK-2 and TASK-3

Submit all your source codes, all your outputs and output files and a comparison write-up for TASK-2 and TASK-3. We need following outputs,

GREP command output of ListOfActionRules file
GREP command output of Mammals book
WordCount v2.0
Modified WordCount