Hadoop SchedulerwithDeadlineConstraint
Geetha J1 N UdayBhaskar 2 P ChennaReddy3
1Department of Computer Science and Engineering, M S Ramaiah Institute of Technology,
Bangalore, Karnataka 560054.
Email-id:
2Department of Computer Science, GDC-Pattikonda, Kurnool dist AP.
Email-id:
3Department of Computer Science and Engineering,JNTU College of Engineering,
Pulivendula,kadapa dist AP.
Email-id :
Abstract—MapReduce hasbecomeapopular programmingmodelforrunningdata intensiveapplications onthe cloud.In theHadoopusually,jobsare scheduledin FIFOorderbydefault.Therearemanymapreduce applications whichrequire strictdeadline.InHadoop framework,scheduler with deadline constraintshas notbeen implemented. Existing schedulersdonot guaranteethat thejobwillbecompleted byaspecificdeadline.Some schedulers address theissueofdeadlinesbutfocuses more onimproving systemutilization. Wehaveproposed an algorithmwhich facilitates the user tospecifyajobs deadline and evaluates whether the job can be finished beforethedeadline. Scheduler with deadlines for Hadoop which ensures that onlyjobs whosedeadlines can bemet arescheduledforexecution.Ifthejobsubmitteddoesnotsatisfythespecified deadline,physicalorvirtual nodescanbeadded dynamicallyto completethejobwithindeadline.
Keywords– Hadoop, Mapreduce,Hadoop Schedulers, HDFS,Namenode,Datanode, Jobtracker, Tasktracker,Deadline.
I. INTRODUCTION
ApacheHadoopisanopensourceimplementation ofthe Google’sMapReduce [1]parallelprocessing framework. Thedetailsofparallelprocessing,includingdistribution of data toprocessingnodes,restartfailedsubtasks,and consolidationof resultsafter computation is hidden by Hadoop framework.This framework allowsdevelopers towriteparallelprocessing programs which focus on computation. Hadoopincludes:1)Hadoop DistributedFileSystem(HDFS) [2],adistributedfilesystem thatstorelargeamountofdatawithhighthroughput accessto dataonclusters.2)HadoopMapReduce:asoftwareframework forprocessingof distributed dataonclusters.
These problems are well-dealt by Hadoop; a reliable, scalable, distributed computing platform developed by Apache [5].It is an open source implementation of Google’s MapReduce framework that allows for the distributed processing of huge data sets transversely clusters of computers using simple programming models. It is designed to level up from single datacenter to thousands of machines, each offering local computation and storage. At the application layer the library is designed to detect and handle failures, Instead of relying on hardware to deliver high-availability, so
a highly-available service on top of a cluster of computers, can be delivered each of which may be prone to failures.
Hadoop has an underlying storage system called HDFS-Hadoop Distributed file system. To process the data in HDFS, Hadoop provides a MapReduce engine that runs on top of HDFS. This engine has master-slave architecture. The master node is called the JobTracker and the slaves are called TaskTrackers. MapReduce jobs are automatically parallelized across a large set of TaskTrackers. The JobTracker splits the job into several maps and reduce tasks. Hadoop divides the input to the job into fixed size pieces called input splits. The outputs of the map tasks are stored in the local disk. This intermediate output serves as input to the reduce tasks. The consolidated output of the reduce task is the output of the MapReduce job and is stored in HDFS.
AsHadoopjobshavetosharetheclusterresources, a schedulingstrategyisusedtodeterminewhenajobcanexecute its tasks. As the pluggableschedulerwas implemented, severalscheduleralgorithms havebeendeveloped foritlike FIFOScheduler, FairScheduler [3],andCapacityScheduler [4].
II. FOUNDATION
A. Problem Definition
Canagiventask thattranslatestoaMapReducejob JandhastoprocessdataofsizeNbecompleted withina deadline D,whenruninaMapReduceclusterhavingMnodes with maptaskslots, reducetaskslotsandpossibly k jobsexecutingatthetime.
B.DeadlineEstimationModel
Wedevelopaninitialestimation modelbasedasetof assumptions
Assumption1:Theclusterconsistsofhomogeneousnodes, sothattheunitcostofprocessingforeachmaporreduce nodeisequal;Assumption2:Keydistributionoftheinput dataisuniform,sothateachreducenodegetsequalamount ofreducedatatoprocess;Assumption 3:Reducetasksstarts afterallmaptaskshavecompleted;Assumption 4:Theinput dataisalreadyavailableinHDFS.
C.Working
Thejobsubmissionprocessimplemented byJobSubmitter doesthefollowing:
Usersubmitsthejob.(Step1).Scheduler willcomputethe minimumnumberofmapandreduceslotsrequiredforthe jobcompletion(step2).Ifschedulabilitytestfailsuserwill benotifiedtoenteranotherdeadlinevalue.Ifpassed,jobwill bescheduled.(Step4).
Fig.1. WorkingofDeadlineScheduler
D.Equations
Toderivetheexpressions fortheminimumnumberofmap tasksnmmin andreduce tasksnrmin,weextend themodel usedinforEqualLoadPartitioningtechniqueby introducing MapReducespecificnotationsJ,f,Cm,Cr,SmandSras describedbelow.
q=(A;σ;D):Aqueryq,whereAisthearrivaltime (timewhenthequeryissubmitted), σistheinputdata size,Distherelativedeadline.
J=(tm1;tm2;tm3;···; tmu;tr1;tr2; ···; trv):A Hadoopjobthatisruntoperformqueryq.tmiistheith
maptaskandtrjisthejthreducetaskwhere1<=i<=u and1<= j<= v.TheHadoopjobJhasthearrivaltime, inputdataanddeadlinesameasthequeryq=(A;σ;D).
n:totalslotsassignedtothejobinthecluster.
n=nm+nrwhere,nm isthemap slotsand nristhereduce slots
α=(α1;α2;··;αu):Mapdatadistribution vector, whereuisthetotalnumberofmapasksofjobJ.αiis thedatafractionallocated totheithmaptask.Mapinput dataisequallydistributedamongthemapnodesso,αi =1/u.
f:Filterratio.Thefractionofinputthatthemapprocess producesasoutput.Formostpracticalpurposes0<=f<=1holds.
fσ:Reduceinput(mapoutput)=productoffilterratio andthemapinput
Cm:Costofprocessingaunitdatainmaptask.
Cr:Costofprocessingaunitdatainreducetask.
Cd:Communicationcostoftransferringunitdata.
Sm:starttimeofthefirstmaptaskforthejob.
To estimate the duration of the job J we consider map completiontime,reducecompletiontimeanddatatransferduringreducecopyphase.Theexpressionforthecostcanbe writtenas:
+ +fcd
Since the job has an arrival time A and deadline D,
Sm + + +fcd A+D
Let, Srmax be the maximum value for the reduce start time,
Srmax =A+D - - fcd
Sm+ Srmax
Which gives
nm ≥
Therefore,
nmmin =
Similarly,
nrmin =
Deadlineconstraintschedulerusesthesecriteriatoschedule Hadoopjobs.
E.InvalidatingAssumptions
Assumption1.Weassumedthatnodesarehomogeneous andthevalueofCmandCrwasuniformforallthenodes. TorelaxthisassumptionweallowforthevalueofCmand Crtodifferamongthenodesmakingthejobdependenton theslowest nodei.e.thevalueofCmandCristhehighest. Therefore,wemodifytheaboveexpressionsfornmmin,nr minandSrmaxbysubstitutingCmandCrwiththevalueof theslowestrunningnode.
Assumption2:Whenthekeydistributionacrossthereduce
tasksisnotuniform, thentheworstcaseisthatasinglenode processesallthereduceinputfσ.Insuchscenario,thevalue ofSrmaxwillbe:
Srmax =A+D - - fcd
Weintendtoaddressinourfutureworktheeffectofvarious data distributionmodels.Startingreducetasksearlymight
Affectotherjobsabilitytoobtainreducetaskslots.Inthis paper,wedonotaddresstheimplicationsofinvalidating.
Assumption 3.Wethinkitisrelevanttoassumethatthe inputisavailableinHDFSbecauseMapReduce processing startsonlyafterthedataisavailable.Ifitbecomesimportant toconsiderthecostoftransferring theinputtoHDFSfrom usersstorage, thecostcanbederived usingthevaluesfor transfernetworklinkcapacityandthereplication cost,and thesecostsarenotdependentonthenatureofMapReduce job.
III. DESIGNANDIMPLEMENTATION
A.SequenceDiagram
Fig.2. WorkingofDeadlineScheduler
B.DesignGoals
ThedesigngoalsforConstraintSchedulerwere:
1) Tobeabletogiveusersimmediatefeedback onwhether the jobcanbecompleted within thegivendeadline ornotandproceedwithexecutionifdeadlinecanbe met.Otherwise, usershavetheoptiontoresubmitwith modifieddeadlinerequirements.
2) Maximizethenumberofjobsthatcanberuninthe clusterwhilesatisfyingthetimerequirementsofalljobs.
C.Algorithms
ScheduleJob(Cluster c)
freemapslots <-c.freeMapSlots
freereduceslots <-c.freeReduceSlots
REPEAT
j <-nextJob(Pool)
arrivaltime <-getStartTime(j)
mapcostperunit <-getMapCostPerUnit()
reducecostperunit <-getReduceCostPerUnit()
shufflecostperunit <-getShuffleCostPerUnit()
mapinputoutputfraction <-getMapInputOutputCostPerUnit()
reducestarttime <-getReduceStartTime()
deadline <-getFloat()
inputsize <-0
REPEAT
inputsize-inputsize+length(inputsplits) UNTIL(endOf(inputspilts))
ComputeMaxReduceStartTime
IFMaxReduceStartTimearrivaltime
THROWEXCEPTION ComputeminMapSlots ComputeminReduceSlots
IF minMapSlotsfreemapslots
OR
minReduceSlotsfreereduceslots
THROWConstraintFailException
UNTIL (EndOf (jobsinPool))
ConstraintFairSchedulerParamsEstimator (jobword count)
Runthejobrandomgenerator Output-”randomoutput” Repeatntimes
Runthejobwordcounttakingrandomgeneratoroutputas inputforthisjob
Computethe parameterstotal map cost, total reduce cost, totaliofraction, totaltimetostartreduce,totaltimeforjob completion
Computetheaverageofeachparameters.
D.Set-up
Experimentswereconductedinphysicalcluster.It consisted of4physicalclusterwithonejobtracker and3tasktrackers. Themachine had4GBmemory with64bitInteli5processor andUbuntu serverpresent ineachnode.Wealsohadvirtual nodeintheclusterwhichcanbeaddedtotheexisting cluster dynamicallyinstalledinoraclevirtualboxononeofthenodes. Both theenvironmentshadtwomapslots andtworeduceslots oneachmachine.
E.Results
Afterthesetup,weestimated thevaluesoftheparameters likemapcost,reducecost,shuffle cost,filterratio,we submittedthejobwithdeadlineandtookobservation for different deadline valueskeepinginputsizeconstant, which shows the variation of required map and reduce slots as showninFig.3.4.Wealsosubmitted thejobwithdifferent inputsizesandtookobservations. Ourobservationshowsthat foraparticular datasizewithincreasingdeadlines, resource demandwilldecrease.
Fig.3Graphfor500MBinput datasize
Fig.4. Graphfor1.2GBinputdatasize
IV. CONCLUSION
Weextendedtheapproachofrealtimeclusterscheduling toderiveminimum mapandreducetaskcountcriteriafor performing taskscheduling withdeadlineconstraints in Hadoop.Wecomputed theamountofresourcerequiredtocompletea jobforaparticulardeadline.Forthis,weproposedtheidea ofestimationofthevaluesofparameters:filter ratio,costof processing aunitdatainmaptask,costofprocessing aunit datainreducetask,communication costoftransferringunit data.
Our observation shows that for a particular data size withincreasing deadlines, resourcedemandwilldecrease. Also,ifthedatasizeincreases anddeadlineiskeptconstant, resourcedemandwillincrease.If resourcedemandincreases, wecanmeetthedemand byaddingphysical orvirtualnodetotheexisting cluster dynamically orprovideafeasibledeadline.Wehave implementedthesame.
REFERENCES
[1] RickyHo,HowMapReduceworksDzonepublishing, retrievedfrom
[2] RobertC,HirangKandSanjayR,Architectureofopensourceapplica- tionsHDFS.
[3] ApacheHadooppublishing,HadoopMapReduceNextGeneration-FairScheduler,December2013, Version2.0.4-alpha.
[4] ArunM,UnderstandingCapacityScheduler,July2012,unpublished.
[5] Shafer J,Rixner S, and Cox, Balancing portability and performanceMarch2010,Performance AnalysisofSystemsandSoftware(ISPASS),2010IEEEInternationalSymposiumonpublishing.
[6] Deadline:www4.ncsu.edu/∼kkc/papers/rev2.pdf
[7] MultiNodeCluster: