ITCS 6161/ITCS 8162: Knowledge Discovery in Databases
Assignment Instructions
Instructions:
Software required:
- Putty:
- WinSCP:
- Oracle Virtual Box:
- Cloudera:
For detailed description on how to install Cloudera, watch this video:
By default, Cloudera contains Eclipse and Hadoop packages installed which can be used to program MapReduce programs.Cloudera contains single node cluster.Use Cloudera to test your code on small inputs. For large inputs, use DSBA-cluster. Once you are confident that your code works correctly, run in the cluster.
******************************************************************************
To Log In to DSBA Hadoop Cluster follow the instructions below :
TASK – 1:Logging into Hadoop cluster and running simple commands
1. To Log-In to Hadoop via FTP client ( in order to copy and paste data and to view files )
Open your FTP Client(WinSCP)
Choose Session | New Session
File protocol SFTP :
Host Name : dsba-hadoop.uncc.edu
Type UserName and Password
and click Save | check the Save Password checkbox
2. Log-in To dsba-hadoop.uncc.edu via the Putty or ( in order to run commands )
3. Run sample text processing on the ListOfInputActionRules using GREP command.
ListOfInputActionRules is a text file containing one action rule per line.
For example:
(a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%]
(a, a1->a3) ^ (b, ->b1) -> (f, f1->f0) [3, 75%]
(a, a1->a3) ^ (c = c2) -> (f, f1->f0) [1, 80%]
(a, ->a3) ^ (b, b2->b1) -> (f, f1->f0) [3, 50%]
4.To move files from client to the cluster, use following command:
hadoop fs -put path-of-the-file-in-client path-of-the-destination-folder-in-cluster
5.Run following GREP command on ListOfInputActionRules to return all lines of text (ActionRules) which contain the word ‘ a1 ‘
hadoop org.apache.hadoop.examples.Grep input-path-of-ListOfInputActionRules-filepath-of-destination-folder ".*a1.*"
NOTE: The destination folder should not exist before running this command. To remove a folder, use following command,
hadoop fs -rm -r path-of-the-folder
6.To get the output folder back to the client, use following command,
hadoop fs -get path-of-the-output-folder-in-cluster path-of-the-folder-in-client
7.Repeat steps 4-6 for the Mammals book text file and return all lines of text which contain the word “mammal”. Download Mammals book text file here:
For TASK-2 and TASK-3, use Mammals book as an input file.
TASK – 2: Running WordCount
Read the "MapReduce Tutorial" from
Basic procedure to follow when executing a MapReduce program in a Hadoop cluster:
- The inputs should be transferred to HDFS from the local system
- The JAR file can reside in the FTP client side(i.e in WinSCP)
The output of MapReduce programs will be written on HDFS which can be transferred back to the local system
To understand how MapReduce works, you can see following links along with example
- for Hadoop version 1.0
All basic HDFS commands can be found here:
- Create a new JAVA project in Cloudera Eclipse
- Cloudera Eclipse contains a sample MapReduce project. That project consists of all required MapReduce jar files. Import all those .jar files into your project.
- Copy WordCount v2.0 from into your project
- Convert your project into a .jar file
- Run the .jar file in a cluster and produce the output
- Use the following command to run the .jar file
- hadoop jar path-of-the-jar-file path-of-input-folder path-of output-folder
TASK – 3: Running modified version of WordCount
- Download the .jar file from
- Save it in the client
- Run and produce the output
TASK – 4: Write-up comparing the results of TASK-2 and TASK-3
Submit all your source codes, all your outputs and output files and a comparison write-up for TASK-2 and TASK-3. We need following outputs,
- GREP command output of ListOfActionRules file
- GREP command output of Mammals book
- WordCount v2.0
- Modified WordCount