1Department of Computer Science and Engineering, M S Ramaiah Institute of Technology

Hadoop SchedulerwithDeadlineConstraint

Geetha J1 N UdayBhaskar 2 P ChennaReddy3

1Department of Computer Science and Engineering, M S Ramaiah Institute of Technology,

Bangalore, Karnataka 560054.

Email-id:

2Department of Computer Science, GDC-Pattikonda, Kurnool dist AP.

Email-id:

3Department of Computer Science and Engineering,JNTU College of Engineering,

Pulivendula,kadapa dist AP.

Email-id :

Abstract—MapReduce hasbecomeapopular programmingmodelforrunningdata intensiveapplications onthe cloud.In theHadoopusually,jobsare scheduledin FIFOorderbydefault.Therearemanymapreduce applications whichrequire strictdeadline.InHadoop framework,scheduler with deadline constraintshas notbeen implemented. Existing schedulersdonot guaranteethat thejobwillbecompleted byaspecificdeadline.Some schedulers address theissueofdeadlinesbutfocuses more onimproving systemutilization. Wehaveproposed an algorithmwhich facilitates the user tospecifyajobs deadline and evaluates whether the job can be finished beforethedeadline. Scheduler with deadlines for Hadoop which ensures that onlyjobs whosedeadlines can bemet arescheduledforexecution.Ifthejobsubmitteddoesnotsatisfythespecified deadline,physicalorvirtual nodescanbeadded dynamicallyto completethejobwithindeadline.

Keywords– Hadoop, Mapreduce,Hadoop Schedulers, HDFS,Namenode,Datanode, Jobtracker, Tasktracker,Deadline.

I. INTRODUCTION

ApacheHadoopisanopensourceimplementation ofthe Google’sMapReduce [1]parallelprocessing framework. Thedetailsofparallelprocessing,includingdistribution of data toprocessingnodes,restartfailedsubtasks,and consolidationof resultsafter computation is hidden by Hadoop framework.This framework allowsdevelopers towriteparallelprocessing programs which focus on computation. Hadoopincludes:1)Hadoop DistributedFileSystem(HDFS) [2],adistributedfilesystem thatstorelargeamountofdatawithhighthroughput accessto dataonclusters.2)HadoopMapReduce:asoftwareframework forprocessingof distributed dataonclusters.

These problems are well-dealt by Hadoop; a reliable, scalable, distributed computing platform developed by Apache [5].It is an open source implementation of Google’s MapReduce framework that allows for the distributed processing of huge data sets transversely clusters of computers using simple programming models. It is designed to level up from single datacenter to thousands of machines, each offering local computation and storage. At the application layer the library is designed to detect and handle failures, Instead of relying on hardware to deliver high-availability, so

a highly-available service on top of a cluster of computers, can be delivered each of which may be prone to failures.

Hadoop has an underlying storage system called HDFS-Hadoop Distributed file system. To process the data in HDFS, Hadoop provides a MapReduce engine that runs on top of HDFS. This engine has master-slave architecture. The master node is called the JobTracker and the slaves are called TaskTrackers. MapReduce jobs are automatically parallelized across a large set of TaskTrackers. The JobTracker splits the job into several maps and reduce tasks. Hadoop divides the input to the job into fixed size pieces called input splits. The outputs of the map tasks are stored in the local disk. This intermediate output serves as input to the reduce tasks. The consolidated output of the reduce task is the output of the MapReduce job and is stored in HDFS.

AsHadoopjobshavetosharetheclusterresources, a schedulingstrategyisusedtodeterminewhenajobcanexecute its tasks. As the pluggableschedulerwas implemented, severalscheduleralgorithms havebeendeveloped foritlike FIFOScheduler, FairScheduler [3],andCapacityScheduler [4].

II. FOUNDATION

A. Problem Definition

Canagiventask thattranslatestoaMapReducejob JandhastoprocessdataofsizeNbecompleted withina deadline D,whenruninaMapReduceclusterhavingMnodes with maptaskslots, reducetaskslotsandpossibly k jobsexecutingatthetime.

B.DeadlineEstimationModel

Wedevelopaninitialestimation modelbasedasetof assumptions

Assumption1:Theclusterconsistsofhomogeneousnodes, sothattheunitcostofprocessingforeachmaporreduce nodeisequal;Assumption2:Keydistributionoftheinput dataisuniform,sothateachreducenodegetsequalamount ofreducedatatoprocess;Assumption 3:Reducetasksstarts afterallmaptaskshavecompleted;Assumption 4:Theinput dataisalreadyavailableinHDFS.

C.Working

Thejobsubmissionprocessimplemented byJobSubmitter doesthefollowing:

Usersubmitsthejob.(Step1).Scheduler willcomputethe minimumnumberofmapandreduceslotsrequiredforthe jobcompletion(step2).Ifschedulabilitytestfailsuserwill benotifiedtoenteranotherdeadlinevalue.Ifpassed,jobwill bescheduled.(Step4).

Fig.1. WorkingofDeadlineScheduler

D.Equations

Toderivetheexpressions fortheminimumnumberofmap tasksnmmin andreduce tasksnrmin,weextend themodel usedinforEqualLoadPartitioningtechniqueby introducing MapReducespecificnotationsJ,f,Cm,Cr,SmandSras describedbelow.

q=(A;σ;D):Aqueryq,whereAisthearrivaltime (timewhenthequeryissubmitted), σistheinputdata size,Distherelativedeadline.

J=(tm1;tm2;tm3;···; tmu;tr1;tr2; ···; trv):A Hadoopjobthatisruntoperformqueryq.tmiistheith

maptaskandtrjisthejthreducetaskwhere1<=i<=u and1<= j<= v.TheHadoopjobJhasthearrivaltime, inputdataanddeadlinesameasthequeryq=(A;σ;D).

n:totalslotsassignedtothejobinthecluster.

n=nm+nrwhere,nm isthemap slotsand nristhereduce slots

α=(α1;α2;··;αu):Mapdatadistribution vector, whereuisthetotalnumberofmapasksofjobJ.αiis thedatafractionallocated totheithmaptask.Mapinput dataisequallydistributedamongthemapnodesso,αi =1/u.

f:Filterratio.Thefractionofinputthatthemapprocess producesasoutput.Formostpracticalpurposes0<=f<=1holds.

fσ:Reduceinput(mapoutput)=productoffilterratio andthemapinput

Cm:Costofprocessingaunitdatainmaptask.

Cr:Costofprocessingaunitdatainreducetask.

Cd:Communicationcostoftransferringunitdata.

Sm:starttimeofthefirstmaptaskforthejob.

To estimate the duration of the job J we consider map completiontime,reducecompletiontimeanddatatransferduringreducecopyphase.Theexpressionforthecostcanbe writtenas:

+ +fcd

Since the job has an arrival time A and deadline D,

Sm + + +fcd A+D

Let, Srmax be the maximum value for the reduce start time,

Srmax =A+D - - fcd

Sm+  Srmax

Which gives

nm ≥

Therefore,

nmmin =

Similarly,

nrmin =

Deadlineconstraintschedulerusesthesecriteriatoschedule Hadoopjobs.

E.InvalidatingAssumptions

Assumption1.Weassumedthatnodesarehomogeneous andthevalueofCmandCrwasuniformforallthenodes. TorelaxthisassumptionweallowforthevalueofCmand Crtodifferamongthenodesmakingthejobdependenton theslowest nodei.e.thevalueofCmandCristhehighest. Therefore,wemodifytheaboveexpressionsfornmmin,nr minandSrmaxbysubstitutingCmandCrwiththevalueof theslowestrunningnode.

Assumption2:Whenthekeydistributionacrossthereduce

tasksisnotuniform, thentheworstcaseisthatasinglenode processesallthereduceinputfσ.Insuchscenario,thevalue ofSrmaxwillbe:

Srmax =A+D - - fcd

Weintendtoaddressinourfutureworktheeffectofvarious data distributionmodels.Startingreducetasksearlymight

Affectotherjobsabilitytoobtainreducetaskslots.Inthis paper,wedonotaddresstheimplicationsofinvalidating.

Assumption 3.Wethinkitisrelevanttoassumethatthe inputisavailableinHDFSbecauseMapReduce processing startsonlyafterthedataisavailable.Ifitbecomesimportant toconsiderthecostoftransferring theinputtoHDFSfrom usersstorage, thecostcanbederived usingthevaluesfor transfernetworklinkcapacityandthereplication cost,and thesecostsarenotdependentonthenatureofMapReduce job.

III. DESIGNANDIMPLEMENTATION

A.SequenceDiagram

Fig.2. WorkingofDeadlineScheduler

B.DesignGoals

ThedesigngoalsforConstraintSchedulerwere:

1) Tobeabletogiveusersimmediatefeedback onwhether the jobcanbecompleted within thegivendeadline ornotandproceedwithexecutionifdeadlinecanbe met.Otherwise, usershavetheoptiontoresubmitwith modifieddeadlinerequirements.

2) Maximizethenumberofjobsthatcanberuninthe clusterwhilesatisfyingthetimerequirementsofalljobs.

C.Algorithms

ScheduleJob(Cluster c)

freemapslots <-c.freeMapSlots

freereduceslots <-c.freeReduceSlots

REPEAT

j <-nextJob(Pool)

arrivaltime <-getStartTime(j)

mapcostperunit <-getMapCostPerUnit()

reducecostperunit <-getReduceCostPerUnit()

shufflecostperunit <-getShuffleCostPerUnit()

mapinputoutputfraction <-getMapInputOutputCostPerUnit()

reducestarttime <-getReduceStartTime()
deadline <-getFloat()

inputsize <-0

REPEAT

inputsize-inputsize+length(inputsplits) UNTIL(endOf(inputspilts))

ComputeMaxReduceStartTime

IFMaxReduceStartTimearrivaltime

THROWEXCEPTION ComputeminMapSlots ComputeminReduceSlots

IF minMapSlotsfreemapslots

minReduceSlotsfreereduceslots

THROWConstraintFailException

UNTIL (EndOf (jobsinPool))

ConstraintFairSchedulerParamsEstimator (jobword count)

Runthejobrandomgenerator Output-”randomoutput” Repeatntimes

Runthejobwordcounttakingrandomgeneratoroutputas inputforthisjob

Computethe parameterstotal map cost, total reduce cost, totaliofraction, totaltimetostartreduce,totaltimeforjob completion

Computetheaverageofeachparameters.

D.Set-up

Experimentswereconductedinphysicalcluster.It consisted of4physicalclusterwithonejobtracker and3tasktrackers. Themachine had4GBmemory with64bitInteli5processor andUbuntu serverpresent ineachnode.Wealsohadvirtual nodeintheclusterwhichcanbeaddedtotheexisting cluster dynamicallyinstalledinoraclevirtualboxononeofthenodes. Both theenvironmentshadtwomapslots andtworeduceslots oneachmachine.

E.Results

Afterthesetup,weestimated thevaluesoftheparameters likemapcost,reducecost,shuffle cost,filterratio,we submittedthejobwithdeadlineandtookobservation for different deadline valueskeepinginputsizeconstant, which shows the variation of required map and reduce slots as showninFig.3.4.Wealsosubmitted thejobwithdifferent inputsizesandtookobservations. Ourobservationshowsthat foraparticular datasizewithincreasingdeadlines, resource demandwilldecrease.

Fig.3Graphfor500MBinput datasize

Fig.4. Graphfor1.2GBinputdatasize

IV. CONCLUSION

Weextendedtheapproachofrealtimeclusterscheduling toderiveminimum mapandreducetaskcountcriteriafor performing taskscheduling withdeadlineconstraints in Hadoop.Wecomputed theamountofresourcerequiredtocompletea jobforaparticulardeadline.Forthis,weproposedtheidea ofestimationofthevaluesofparameters:filter ratio,costof processing aunitdatainmaptask,costofprocessing aunit datainreducetask,communication costoftransferringunit data.

Our observation shows that for a particular data size withincreasing deadlines, resourcedemandwilldecrease. Also,ifthedatasizeincreases anddeadlineiskeptconstant, resourcedemandwillincrease.If resourcedemandincreases, wecanmeetthedemand byaddingphysical orvirtualnodetotheexisting cluster dynamically orprovideafeasibledeadline.Wehave implementedthesame.

REFERENCES

[1] RickyHo,HowMapReduceworksDzonepublishing, retrievedfrom

[2] RobertC,HirangKandSanjayR,Architectureofopensourceapplica- tionsHDFS.

[3] ApacheHadooppublishing,HadoopMapReduceNextGeneration-FairScheduler,December2013, Version2.0.4-alpha.

[4] ArunM,UnderstandingCapacityScheduler,July2012,unpublished.

[5] Shafer J,Rixner S, and Cox, Balancing portability and performanceMarch2010,Performance AnalysisofSystemsandSoftware(ISPASS),2010IEEEInternationalSymposiumonpublishing.

[6] Deadline:www4.ncsu.edu/∼kkc/papers/rev2.pdf

[7] MultiNodeCluster: