Report of B649 Project 0 Part 2

Report of B649 Project 0 part 2

Author : Fei Teng

Date : 9/22/2011

This report is organized as following: Section 1 gives a brief introduction about the report content, Section 2 describes the Hadoop configuration details based on the homework requirement; Section 3 presents the performance analysis results of the experiments both from cs machines and virtual machines of Eucalyptus; Section is some suggestion about improving FutureGrid.

Section 1 Introduction

The report presents the performance analysis of Hadoop MapReduce computing framework against a simple word count application. Two executive environments are prepared for this purpose. The first part experiment is deployed on the local physical Linux machines of Indiana University. The second one runs on the virtual machines provided by FutureGrid[1] computing grid. The performance analysis is based on the metrics of process real executive time. Details about environment settings and problem size will be listed in section 3.

Section 2 Hadoop Configuration

According to report requirement, I will explain my understanding to Hadoop configuration with reference to several essential files in a line by line fashion. The program implementation details please refer to [2].

Hadoop configuration is controlled by lots of .xml files. Most of parameter tuning job can be easily finished by revised the value of each item in these files. In this experiment, the following files played major roles:

1)conf/masters: specify the node identities which will be used as NameNode of HDFS or JobTracker of Hadoop, in the form of IP address or hostname.

2)Conf/slaves: specify the node identities which will be used as DataNode of HDFS or TaskTracker of Hadoop, in the form of IP address or hostname.

3)conf/core-site.xml: site-specific configuration for a given hadoop installation. There’s just one property in it for my experiment, fs.default.name, which tells Hadoop the URI of namenode of the HDFS.

4)conf/hdfs-site.xml: HDFS related configurations in the experiment are included here.

dfs.http.address: The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port.
dfs.name.dir: Path to store HDFS name table directory within MasterNode/NameNode.
dfs.data.dir: Path to store blocks within DataNodes/Slaves, must be unique.
dfs.secondary.http.address:The secondary namenode http server address and port. If the port is 0 then the server will start on a free port.
dfs.datanode.address: The address where the datanode server will listen to. If the port is 0 then the server will start on a free port.
dfs.datanode.http.address:The datanode http server address and port. If the port is 0 then the server will start on a free port.
dfs.datanode.ipc.address:The datanode ipc server address and port. If the port is 0 then the server will start on a free port.

5)conf/mapred-site.xml: MapReduce running parameter settings.

mapred.job.tracker: IP/Hostname:Port for Hadoop JobTracker
mapred.job.tracker.http.address:The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port.
mapred.local.dir: data node's local tmp directory.
mapred.task.tracker.http.address:The task tracker http server address and port. If the port is 0 then the server will start on a free port.
mapred.tasktracker.map.tasks.maximum: maximum map tasks per node. We set the number of mappers here.
mapred.tasktracker.reduce.tasks.maximum: maximum map tasks per node. We set the number of reducers here.

What’s worthy of our attention is that in the shared environment, we must choose different ports for each MapReduce related process or else the process will not be able to start because of the port collision among different users.

Section 3 Performance analysis

The section gives out the performance data of WordCount program running on different envrionments.

3.1 Local physical machines performance results

Environment settings:

Cpu: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
Mem: 4GB
Mappers: 2 for each node
Reducers:2 for each node
Input file size: 230MB

Table 1 gives the executive time of one node and two nodes Hadoop cluster and Figure 1 shows the line chart of performance results.

Times / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / Avg
One node / 91.55 / 89.53 / 91.46 / 101.78 / 93.67 / 90.66 / 90.61 / 89.50 / 96.59 / 91.49 / 92.68
Two nodes / 55.47 / 54.66 / 54.53 / 54.53 / 58.4 / 58.4 / 55.46 / 52.45 / 55.37 / 52.52 / 55.18

Table 1. Local physical machine running results

Figure 1. (local machines)executive time

The parallel speed up can be easily noticed from Figure 4.

3.2 FutureGrid virtual machines performance results

Environment settings:

Cpu: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
Mem: 1GB
Mappers: 2 for each node
Reducers: 2 for each node
Input file size: 230MB

Table 2 gives the executive time of one node and two nodes Hadoop cluster and Figure 2 shows the line chart of performance results.

Times / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / Avg
One node / 170 / 160 / 170 / 165 / 161 / 162 / 163 / 162 / 169 / 169 / 165.1
Two nodes / 109.4 / 106.7 / 108.8 / 107.9 / 101.9 / 101.1 / 103.8 / 98.8 / 106.0 / 104.0 / 104.8

Table 2. FutureGrid virtual machine running results

Figure 2. (virtual machines)executive time

As we can see from Figure 3 and 4, because of the hardware setting difference, the WordCount performance on local machines is much better than on virtual machines.

Figure 3. Average executive time on both environments

Figure 4. Parallel speed-up on from both tests

Section 4 Suggestions

I suppose that the medium images from FutureGrid are not enough for thorough tests because of the disk space is just 2GB. Perhaps we can be allocated with virtual machines with better performance to run more complex experiments. In addition, the network connection of FutureGrid is not that reliable, for example, it takes everyone at least two days to be able to access eucalyptus resource because of a network issue.

References:

[1]

[2]