B649 Project 1 Part 2 Report

Bingjing Zhang

Abstract

This report is for Project 1 Part 2 in B649 Cloud Computing. The reports gives testsand analysis to the performance of Hadoop MapReduce [1],a software framework for distributed processing of large data sets on compute clustersby using traditional Linux clusters in CS Department [2] and Eucalyptus in FutureGrid [3]. The tests provide abasic methodology of doing Hadoop performance testsand the results show that the environments can affect Hadoop performance differently.

Introduction

Hadoop MapReduce is an open source implementationbased on Google’s MapReduce. It works on HDFS [4], an open source implementation of Google File System [5]. As a result, it is important to know how to run Hadoop MapReduce and how its performance is like in learning MapReduce.

By the requirement of the assignment, the famous MapReduce program example WordCount [6] is used as an example to do performance test. In the tests, WordCount is firstly executed on one compute node, and then executed on two nodes. By recording the execution time, a speed-up chart can be drawn to show how much performance is gained by using more compute resources. The tests are done on two different environments. One is the traditional Linux clusters in CS department, the compute nodes here are physical machines. Another is Eucalyptus, which is a cloud environment similar to Amazon Web Services [7] therefore the compute nodes provided are virtual machine.

The report includes three parts, the first part is the Hadoop cluster configuration used in the test, and the second part is about the test results, the third part is a feedback to FutureGridenvironment.

Hadoop Cluster Configuration

The Hadoop version used here is 0.20.203.0, the current stable version introduced by Hadoop official website [8]. Before Hadoop cluster configuration, there are several steps required to be done first.

Install Java, set JAVA_HOME in ~/.bashrc and conf/hadoop-env.sh.
Setup passphraseless SSH.
Set Hostname. This requires adding “ip hostname” lines in /etc/hosts, and calling “hostname” command.

These steps are optional if the compute nodes have already been provided with these features.Without Step 1 and 2, one cannot start a Hadoop environment properly. For step 3, if you referring machines by their IP addresses in configuration files, you can start a Hadoop environment, but you will meet errors in shuffling stage of MapReduce job execution. In this performance test, Installing Hadoop in CS Linux cluster only requires Step 1 and 2, but for virtual machines in FutureGrid, all 3 steps are required.

The key part of Hadoop Configuration is editing the following 5 files: conf/master, conf/slaves, conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml.

File conf/master lists the hostnames of master nodes, each with one line. Master nodes are used as NameNode and JobTracker. NameNode is central node to manage and control HDFS, while JobTracker is the program to manage MapReduce jobs. Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. But in this test, only one node is used as master. It acts as NameNode and JobTracker at the same time.

File conf/slave lists the hostnames of slave nodes, each with one line. A slave node acts as both DataNode and TaskTracker.DataNode is the slave of NameNode. It is responsible to manage local data storage.TaskTracker is the slave of JobTracker. It manages local Map/Shuffle/Reduce tasks. In this performance test, master node is also used as slave node.

To further configureNameNode and JobTracker on master node, you need to editconf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml.In core-site.xml, youare required to set NameNodeURI, with the value in a format of hdfs://hostname:port. In this performance test, the port is set to 9000.

In conf/hdfs-site.xml, you can set dfs.name.dir to a path on a Namenode where it stores the namespace and transactions logs and set dfs.data.dir to a path on a DataNode where it should store its data blocks.

In conf/mapred-site.xml, you can set the URI of JobTracker in a format of hostname:port. The maximum number of Map and Reduce tasks per node is also set here, by editing mapred.tasktracker.{map|reduce}.tasks.maximum. In this performance test, the number of Map tasks is set to 2, so is the number of Reduce tasks.

The ports used in edit conf/core-site.xml and conf/mapred-site.xml should be unique available ports on a shared environment such as CS Linux Cluster. Otherwise, the other users cannot start Hadoop environment properly if there has already been one environment existing.

Performance Test

In performance test, a 133.8 MB input is used for WordCount. Under two different environments, CS Linux Cluster and Eucalyptus in FutureGrid, firstly the test is done on one compute node, and then is expanded to two nodes. By the difference of execution times, speed up can be learnt.

The machine settings are following:

Table 1. Machine CPU and memory specs

number of cores per node / core specs / memory
CS Linux Cluster / 2 / Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz / 4 GB
Eucalyptus in Future Grid / 1 / Intel(R) Xeon(R) CPU X5570 @ 2.93GHz / 1 GB

In CS Linux Cluster, two sets of machines are used in test. The first set of machines, iceman and wolverine, are both under Shark domain. The second set of machines, rogue and bandicoot are from different domains. Rogue is in Shark domain, but bandicoot is in burrow domain. The following are performance results:

Table 2. Execution time on iceman and wolverine

number of mappers on each node / number of reducers on each node / total number of nodes / execution time (an average of 10 runs, unit: milliseconds) / speed up (between single node and two nodes mode)
2 / 2 / 1 / 78383.5 / 1
2 / 2 / 2 / 68525.6 / 1.14

Table 3. Execution time on rogue and bandicoot

From two test cases above, we see that the speed up is very small, probably due to the fact that CS Linux Cluster is a shared cluster environment; it has lots of daily workload, and also has lots of network traffic inside of the cluster.

For Eucalyptus in FutureGrid, though each virtual machine only has one core, the test still sets 2 cores for each node by the requirement of the assignment.For performance results, though the virtual machines have longer execution times, they have much better speed up values.

Table 4. Execution time on Eucalyptus

For execution time chart, see Figure 1. And for speed up chart, see Figure 2.

FutureGrid Feedback

Eucalyptus in FutureGrid provides functionalities comparable to Amazon EC2. It is excellent that you can have some virtual resources only own by you and without pay. Under the context of building Hadoop environment, it will be easier for Hadoop configuration if the virtual machines have their own default unique hostnames once they get running.

Conclusion

This report shows a basic methodology to do performance test on Hadoop. The results show that the environment is important in performance test. If the test is only done in CS Linux Cluster, you may conclude that the speed up of Hadoop is bad. But after doing another test in FutureGrid, you may find the speed up of Hadoop can be much better. Since CS Linux Cluster is a shared cluster environment, the results there is misleading to some extent. The results on Eucalyptus are much better, but we still need to understand its hidden infrastructure before drawing the final conclusion.

Figure 1. Execution time chart

Figure 2. Speed up chart

Reference

[1] Hadoop MapReduce,

[2] Burrow and Shark,

[3] FutureGrid,

[4] HDFS,

[5] GFS,

[6] WordCount,

[7] AWS,

[8] Hadoop Common,