Report of B649 Project 0 part 2
Author : Fei Teng
Date : 9/22/2011
This report is organized as following: Section 1 gives a brief introduction about the report content, Section 2 describes the Hadoop configuration details based on the homework requirement; Section 3 presents the performance analysis results of the experiments both from cs machines and virtual machines of Eucalyptus; Section is some suggestion about improving FutureGrid.
Section 1 Introduction
The report presents the performance analysis of Hadoop MapReduce computing framework against a simple word count application. Two executive environments are prepared for this purpose. The first part experiment is deployed on the local physical Linux machines of Indiana University. The second one runs on the virtual machines provided by FutureGrid[1] computing grid. The performance analysis is based on the metrics of process real executive time. Details about environment settings and problem size will be listed in section 3.
Section 2 Hadoop Configuration
According to report requirement, I will explain my understanding to Hadoop configuration with reference to several essential files in a line by line fashion. The program implementation details please refer to [2].
Hadoop configuration is controlled by lots of .xml files. Most of parameter tuning job can be easily finished by revised the value of each item in these files. In this experiment, the following files played major roles:
1)conf/masters: specify the node identities which will be used as NameNode of HDFS or JobTracker of Hadoop, in the form of IP address or hostname.
2)Conf/slaves: specify the node identities which will be used as DataNode of HDFS or TaskTracker of Hadoop, in the form of IP address or hostname.
3)conf/core-site.xml: site-specific configuration for a given hadoop installation. There’s just one property in it for my experiment, fs.default.name, which tells Hadoop the URI of namenode of the HDFS.
4)conf/hdfs-site.xml: HDFS related configurations in the experiment are included here.
- dfs.http.address: The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port.
- dfs.name.dir: Path to store HDFS name table directory within MasterNode/NameNode.
- dfs.data.dir: Path to store blocks within DataNodes/Slaves, must be unique.
- dfs.secondary.http.address:The secondary namenode http server address and port. If the port is 0 then the server will start on a free port.
- dfs.datanode.address: The address where the datanode server will listen to. If the port is 0 then the server will start on a free port.
- dfs.datanode.http.address:The datanode http server address and port. If the port is 0 then the server will start on a free port.
- dfs.datanode.ipc.address:The datanode ipc server address and port. If the port is 0 then the server will start on a free port.
5)conf/mapred-site.xml: MapReduce running parameter settings.
- mapred.job.tracker: IP/Hostname:Port for Hadoop JobTracker
- mapred.job.tracker.http.address:The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port.
- mapred.local.dir: data node's local tmp directory.
- mapred.task.tracker.http.address:The task tracker http server address and port. If the port is 0 then the server will start on a free port.
- mapred.tasktracker.map.tasks.maximum: maximum map tasks per node. We set the number of mappers here.
- mapred.tasktracker.reduce.tasks.maximum: maximum map tasks per node. We set the number of reducers here.
What’s worthy of our attention is that in the shared environment, we must choose different ports for each MapReduce related process or else the process will not be able to start because of the port collision among different users.
Section 3 Performance analysis
The section gives out the performance data of WordCount program running on different envrionments.
3.1 Local physical machines performance results
Environment settings:
- Cpu: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
- Mem: 4GB
- Mappers: 2 for each node
- Reducers:2 for each node
- Input file size: 230MB
Table 1 gives the executive time of one node and two nodes Hadoop cluster and Figure 1 shows the line chart of performance results.
Times / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / AvgOne node / 91.55 / 89.53 / 91.46 / 101.78 / 93.67 / 90.66 / 90.61 / 89.50 / 96.59 / 91.49 / 92.68
Two nodes / 55.47 / 54.66 / 54.53 / 54.53 / 58.4 / 58.4 / 55.46 / 52.45 / 55.37 / 52.52 / 55.18
Table 1. Local physical machine running results
Figure 1. (local machines)executive time
The parallel speed up can be easily noticed from Figure 4.
3.2 FutureGrid virtual machines performance results
Environment settings:
- Cpu: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
- Mem: 1GB
- Mappers: 2 for each node
- Reducers: 2 for each node
- Input file size: 230MB
Table 2 gives the executive time of one node and two nodes Hadoop cluster and Figure 2 shows the line chart of performance results.
Times / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / AvgOne node / 170 / 160 / 170 / 165 / 161 / 162 / 163 / 162 / 169 / 169 / 165.1
Two nodes / 109.4 / 106.7 / 108.8 / 107.9 / 101.9 / 101.1 / 103.8 / 98.8 / 106.0 / 104.0 / 104.8
Table 2. FutureGrid virtual machine running results
Figure 2. (virtual machines)executive time
As we can see from Figure 3 and 4, because of the hardware setting difference, the WordCount performance on local machines is much better than on virtual machines.
Figure 3. Average executive time on both environments
Figure 4. Parallel speed-up on from both tests
Section 4 Suggestions
I suppose that the medium images from FutureGrid are not enough for thorough tests because of the disk space is just 2GB. Perhaps we can be allocated with virtual machines with better performance to run more complex experiments. In addition, the network connection of FutureGrid is not that reliable, for example, it takes everyone at least two days to be able to access eucalyptus resource because of a network issue.
References:
[1]
[2]