ENHANCE THE PERFORMANCE OF CLOUD COMPUTING WITH HADOOP

7

Dalya Raad Abbas

Kirkuk University, College of Science, Computer Science Department

Ministry of Higher Education and Scientific Research, Iraq

Email:

7

Abstract— Cloud computing start as an option, now slowly turn to become a necessity, cloud provide a quick solution, it can be consider cheap if it compared to another solutions, but the cloud like any other thing has disadvantages, which need to be vanished, to enhance the cloud environment, in the cloud especially in the IaaS services need to improve these words ” pay less, get more profit”, not only in the IaaS service also keep the client’s data save, keeping the data save can be the first request to the client to make him/her trust in the cloud, and many other more.

In the other side there a technology called Hadoop, which can be consider a new technology in the world of cloud computing, hadoop depend on smart strategy, hadoop uses cheap hardware requirements but provide much more, its provide fast processing of data comparing to the cheap environment, provide more than of the provided hardware requirements for storage, its provide a technique for saving the data from losing and more, hadoop improved its success with many of the successful web browsers like twitter, Facebook etc.

In the previous paper the performance of analyzing data with help of both hadoop and the cloud had been examined, using the basic tools of hadoop, the results was impressive, the results was 86% less of time comparing with big size of the processed files.

In this paper using another strategy of enhancing the performance of both hadoop and the cloud computing, in this paper the results that obtained, when it comparing previous paper, 98% of time comparing with big size of processed files had been got.

Keywords—Cloud Computing; Hadoop; Java;

I.  Introduction

Based on "National institute standards and technology"(NIST) in 2013[10] the Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

Cloud computing three major types of services, SaaS is the model that enabling an application is hosted as a service to customers who access it via the Internet, PaaS the second type of cloud computing services in this service the service provider is responsible of delivering another type of applications, these applications are the software resources, the PaaS vendor supplies all the resources required to build applications and services completely via the internet, without having to download or install software, IaaS is simply offers the hardware so that your organization can use it, rather than purchase servers, racks, so simply you can use the data center without building and purchase it, the service provider rents those resources to you, the IaaS can be dynamically scaled up or down, based on the application resource needs.

[8]Hadoop is an open source framework, which deal with the distributed system and has the ability to analysis and execute BIG DATA or files that contain huge size ( terabytes and more).

Hadoop consist of two major components the first one the servers which responsible of the storage, the second one is the technique used to connect the servers inside hadoop cluster.

The first part called HDFS, HDFS Characterized from the others distributed systems such as NFS, NFS provides one machine to store the client’s files by making this portion is visible to them, so think if there is a big load in this server from the clients, this server can be crashed, and many other problems in the other distributed systems, HDFS has the ability to solve the problems of the other distributed systems, HDFS divide the storage in three major components, these components are NameNode, DataNode, Secondary NameNode, the NameNode work as an index first of all the NameNode receives incoming files which need to store for later processing from the clients, divide the data, then distribute these partitioned data to the other servers in the cluster of hadoop, DataNode is the server which work for storing the divided data which come from the NameNode, also hadoop provide a backup for the NameNode if any kind of failure happens, the Secondary NameNode which work as a backup for the server NameNode, the second component called MapReduce the heart of hadoop , MapReduce is a programming framework tool , originally created by Google but later had been developed by Apache hadoop, consist of JobTracker TaskTracker, the client contact with JobTracker send a request for processing the data, the JobTracker send the request to one of the TaskTrackers, then the processing start.

II.  RELATED WORKS

[1] The researchers in this dissertation had merged hadoop and cloud computing and implemented the security access to the cloud using finger print identification and face identification, the hadoop cloud computing used in this dissertation connected to the mobile and thin clients and had been connected using wired or wireless network , the master server connected to the salves server through hadoop cloud computing, the implemented cloud computing in this dissertation produce services SaaS, PaaS, IaaS, the server operating system implemented in this dissertation was Linux platform connected via Ethernet, Wi-Fi, 3G, to connect to the client, the cloud had been built using (JamVM) virtual machine as a platform to build J2ME that work as a java platform for mobile clients.

The results showed that finger identification and face identification had been processed within only 2.2 seconds to get the identification of the person.

[2]The researchers in this dissertation had studied and improved that the performance of Vm , can be increased with the load balancing to all Vm in the cloud, they had implanted hadoop among these Vm to get the load balancing feature , the example cloud they had used is Eucalyptus cloud ,the system that they proposed EuQoS system also they tested the proposed system in the real- time data ,then they made a comparison between their proposed system and the normal Eucalyptus cloud hadoop with the they had found that their proposed system improved their performance with 6.94% comparing with hadoop.

The proposed EuQoS system consist of HDFS , MapReduce , HBase , and load balancing, the MapReduce task be responsible of mapping the jobs and the HDFS be responsible of saving the status of mapping and produce it to the reduce task, in the HDFS they had inherit the basic idea of HDFS and created DFS which contain of namenode that will be responsible of open and name and clos the files and datanode it will be responsible of read or write the request and deal the actual files that stored in it , in the HBase phase in the EuQoS system will be merged with the Eucalyptus cloud , the HMaster server will assign the work to the Hregion and if these Hregion failed to do the work , then the HMaster will reassign the work to another Hregion, the Hregion server be responsible on the on the read and write and the requests that come from HMaster , the last components of the EuQoS system is Load balancing , the load balancing consist from two components load balancer and agent-based monitor , the load balancer is responsible of these three functions balance triggering, EuQoS scheduling and VM controlling.

As a result the researchers had used the IPV for the connection between the VM and they found that the performance of Eucalyptus cloud with hadoop was less than the EuQoS system.

[3] The researchers in this dissertation made comparison between the other PAAS used in the cloud (hadoop, Dryad ,HPCC), also they proposed by using cloud computing and hadoop an application handle one of the most complexity data the vision computing, they had created a Rhizome cloud for detection and computing the vision with the help of hadoop so there will be a speed analyzing for the captured data from video , they had used an representation as a input for the cloud , the nodes used in the cluster of hadoop treated as a grains , the algorithm used among the nodes are (FIFO), each grain worked Independent from the other grains , because hadoop offer this feature which increase the ability to protect the whole cloud and also ensuring that when some failure happen in one of the grains will affect to the other grains , the grains work under the grain manager (masternode ) the grain manger will take the responsible of watching the work of the other grains , these grains are responsible of computing the vision by using video capture, motion detection or even matrix multiplication, the performance of hadoop with JNI Rhizome had been tested with another application using hadoop with Rhizome, the input files was 200000 for two hours video ,the INJ with Rhizome had taken for the 200000 frame around 80 minute, but using only hadoop with Rhizome take for the same input of frames around (31) minute, also another comparison by using the matrix multiplication the video with high Quality (1920 x 1080) take 8 M/Sec while with the using hadoop for analyzing the Surveillance cameras had improved getting speed result with only little Milliseconds.

[4]In the previous paper an investigation had been built to improve the performance of hadoop and the cloud computing, the performance of the cloud computing with the basic tools of hadoop , had improved a success with 86 %, so in this paper another investigation will made based on the previous paper another scenario built according to the need of the client, some needed to increase their tools, another needed to decrease the used tools, and to make this investigation fair enough, same program had been used, but had been built differently than the previous scenario.

first part of this paper a rich material will be produce on the Cloud Computing and Hadoop , to understand the characteristics ,architecture and the work nature of both Cloud Computing and Hadoop separately, The second part of this paper is listing a group of the previous works that had done in the previous years, also studying these works with some kind of details, The third part of this paper list the objectives of this paper, the forth part aims to inauguration a tiny virtual cloud, then inauguration of Hadoop single cluster in one machine, building two different Cases to investigate the behavior of the work of Cloud Computing and Hadoop, The fifth part will presents the results of processing of 72 files of different sizes and contents from the two Cases that had been built in the fourth part, then a discussion is obtained with results, to determine if there is an improvement, based on the merging of the Cloud Computing with Hadoop, The sixth part present a final conclusion .

III.  Objective

The objective of this paper, build a tiny virtual cloud and merge hadoop with the cloud, then built two different kinds of cases, one of these cases include the enhanced hadoop and the cloud both together, the other case is the cloud itself without hadoop, then test the performance of hadoop and cloud, compare the obtained results of enhanced hadoop and the cloud in one side and in the other side the results of the cloud without hadoop can be increased or not.

IV.  METHODOLOGY

First of all a tiny cloud of normal resource requirement had been established then single cluster hadoop installed, this cloud include all component of hadoop in the same cloud, in this cloud a virtual data center had been built, which mean the master server and the slave server in the same cloud.

The administrator of the cloud can monitor the performance of the running servers in the virtual data center, using two window interfaces which provide the complete information of the virtual data center.

The health of the HDFS or the servers can be monitor, using especial site, the administrator can connect to the HDFS with this port number (50070).

Also monitoring the performance of MapReduce, and monitoring every request come from the client, and how this request will be served, how many time it will take till finishing each request, from this port number this job can be accomplish (50030).

Figure 1: The HDFS of Hadoop

Figure 2: The MapReduce of Hadoop

Since hadoop had been built with Java language, also the second case of the cloud using only java, so java programming language had been also installed, the platform used to work with java, and connecting java with hadoop was Eclipse, one of the platforms used to run java language, what make this platform special, it’s characterized with simplicity and with friendly interface.

Two different kinds of scenarios had been built to test the performance of hadoop and cloud computing, first case is to test the performance of hadoop with cloud computing, the second case is to test the performance of cloud computing itself, without hadoop, the performance of this two cases aims on the ability to process the big data or files of big sizes.

A.  Case (1)

In this case had been created another different two of MapReduce programs, to test the performance of the hadoop using more tools in one of program, these used tools used in this case didn’t used in the previous paper, also in the other program had been decreased the used tools in the previous paper, this case had been done in two ways.