Tải bản đầy đủ
Chapter 6. Big Data Processing Concepts

Chapter 6. Big Data Processing Concepts

Tải bản đầy đủ

principle.Itisdetailedwithinthischapteraswell.
TofurtherthediscussionofBigDataprocessing,eachofthefollowingconceptswillbe
examinedinturn:
•paralleldataprocessing
•distributeddataprocessing
•Hadoop
•processingworkloads
•cluster

ParallelDataProcessing
Paralleldataprocessinginvolvesthesimultaneousexecutionofmultiplesub-tasksthat
collectivelycomprisealargertask.Thegoalistoreducetheexecutiontimebydividinga
singlelargertaskintomultiplesmallertasksthatrunconcurrently.
Althoughparalleldataprocessingcanbeachievedthroughmultiplenetworkedmachines,
itismoretypicallyachievedwithintheconfinesofasinglemachinewithmultiple
processorsorcores,asshowninFigure6.1.

Figure6.1Ataskcanbedividedintothreesub-tasksthatareexecutedinparallelon
threedifferentprocessorswithinthesamemachine.

DistributedDataProcessing
Distributeddataprocessingiscloselyrelatedtoparalleldataprocessinginthatthesame
principleof“divide-and-conquer”isapplied.However,distributeddataprocessingis
alwaysachievedthroughphysicallyseparatemachinesthatarenetworkedtogetherasa
cluster.InFigure6.2,ataskisdividedintothreesub-tasksthatarethenexecutedonthree
differentmachinessharingonephysicalswitch.

Figure6.2Anexampleofdistributeddataprocessing.

Hadoop
Hadoopisanopen-sourceframeworkforlarge-scaledatastorageanddataprocessingthat
iscompatiblewithcommodityhardware.TheHadoopframeworkhasestablisheditselfas
adefactoindustryplatformforcontemporaryBigDatasolutions.Itcanbeusedasan
ETLengineorasananalyticsengineforprocessinglargeamountsofstructured,semistructuredandunstructureddata.Fromananalysisperspective,Hadoopimplementsthe
MapReduceprocessingframework.Figure6.3illustratessomeofHadoop’sfeatures.

Figure6.3Hadoopisaversatileframeworkthatprovidesbothprocessingandstorage
capabilities.

ProcessingWorkloads
AprocessingworkloadinBigDataisdefinedastheamountandnatureofdatathatis
processedwithinacertainamountoftime.Workloadsareusuallydividedintotwotypes:
•batch
•transactional

Batch
Batchprocessing,alsoknownasofflineprocessing,involvesprocessingdatainbatches
andusuallyimposesdelays,whichinturnresultsinhigh-latencyresponses.Batch
workloadstypicallyinvolvelargequantitiesofdatawithsequentialread/writesand
compriseofgroupsofreadorwritequeries.
Queriescanbecomplexandinvolvemultiplejoins.OLAPsystemscommonlyprocess
workloadsinbatches.StrategicBIandanalyticsarebatch-orientedastheyarehighly
read-intensivetasksinvolvinglargevolumesofdata.AsshowninFigure6.4,abatch
workloadcomprisesgroupedread/writesthathavealargedatafootprintandmaycontain
complexjoinsandprovidehigh-latencyresponses.

Figure6.4Abatchworkloadcanincludegroupedread/writestoINSERT,SELECT,
UPDATEandDELETE.

Transactional
Transactionalprocessingisalsoknownasonlineprocessing.Transactionalworkload
processingfollowsanapproachwherebydataisprocessedinteractivelywithoutdelay,
resultinginlow-latencyresponses.Transactionworkloadsinvolvesmallamountsofdata
withrandomreadsandwrites.
OLTPandoperationalsystems,whicharegenerallywrite-intensive,fallwithinthis
category.Althoughtheseworkloadscontainamixofread/writequeries,theyaregenerally
morewrite-intensivethanread-intensive.
Transactionalworkloadscompriserandomreads/writesthatinvolvefewerjoinsthan
businessintelligenceandreportingworkloads.Giventheironlinenatureandoperational
significancetotheenterprise,theyrequirelow-latencyresponseswithasmallerdata
footprint,asshowninFigure6.5.

Figure6.5Transactionalworkloadshavefewjoinsandlowerlatencyresponsesthan
batchworkloads.

Cluster
Inthesamemannerthatclustersprovidenecessarysupporttocreatehorizontallyscalable
storagesolutions,clustersalsoprovidesthemechanismtoenabledistributeddata
processingwithlinearscalability.Sinceclustersarehighlyscalable,theyprovideanideal
environmentforBigDataprocessingaslargedatasetscanbedividedintosmallerdatasets
andthenprocessedinparallelinadistributedmanner.Whenleveragingacluster,BigData
datasetscaneitherbeprocessedinbatchmodeorrealtimemode(Figure6.6).Ideally,a
clusterwillbecomprisedoflow-costcommoditynodesthatcollectivelyprovideincreased
processingcapacity.

Figure6.6Aclustercanbeutilizedtosupportbatchprocessingofbulkdataand
realtimeprocessingofstreamingdata.
Anadditionalbenefitofclustersisthattheyprovideinherentredundancyandfault

tolerance,astheyconsistofphysicallyseparatenodes.Redundancyandfaulttolerance
allowresilientprocessingandanalysistooccurifanetworkornodefailureoccurs.Dueto
fluctuationsintheprocessingdemandsplaceduponaBigDataenvironment,leveraging
cloud-hostinfrastructureservices,orready-madeanalyticalenvironmentsasthebackbone
ofacluster,issensibleduetotheirelasticityandpay-for-usemodelofutility-based
computing.

ProcessinginBatchMode
Inbatchmode,dataisprocessedofflineinbatchesandtheresponsetimecouldvaryfrom
minutestohours.Aswell,datamustbepersistedtothediskbeforeitcanbeprocessed.
Batchmodegenerallyinvolvesprocessingarangeoflargedatasets,eitherontheirownor
joinedtogether,essentiallyaddressingthevolumeandvarietycharacteristicsofBigData
datasets.
ThemajorityofBigDataprocessingoccursinbatchmode.Itisrelativelysimple,easyto
setupandlowincostcomparedtorealtimemode.StrategicBI,predictiveand
prescriptiveanalyticsandETLoperationsarecommonlybatch-oriented.

BatchProcessingwithMapReduce
MapReduceisawidelyusedimplementationofabatchprocessingframework.Itishighly
scalableandreliableandisbasedontheprincipleofdivide-and-conquer,whichprovides
built-infaulttoleranceandredundancy.Itdividesabigproblemintoacollectionof
smallerproblemsthatcaneachbesolvedquickly.MapReducehasrootsinboth
distributedandparallelcomputing.MapReduceisabatch-orientedprocessingengine
(Figure6.7)usedtoprocesslargedatasetsusingparallelprocessingdeployedoverclusters
ofcommodityhardware.

Figure6.7Thesymbolusedtorepresentaprocessingengine.
MapReducedoesnotrequirethattheinputdataconformtoanyparticulardatamodel.
Therefore,itcanbeusedtoprocessschema-lessdatasets.Adatasetisbrokendowninto
multiplesmallerparts,andoperationsareperformedoneachpartindependentlyandin
parallel.Theresultsfromalloperationsarethensummarizedtoarriveattheanswer.
Becauseofthecoordinationoverheadinvolvedinmanagingajob,theMapReduce
processingenginegenerallyonlysupportsbatchworkloadsasthisworkisnotexpectedto
havelowlatency.MapReduceisbasedonGoogle’sresearchpaperonthesubject,
publishedinearly2000.
TheMapReduceprocessingengineworksdifferentlycomparedtothetraditionaldata
processingparadigm.Traditionally,dataprocessingrequiresmovingdatafromthestorage
nodetotheprocessingnodethatrunsthedataprocessingalgorithm.Thisapproachworks
fineforsmallerdatasets;however,withlargedatasets,movingdatacanincurmore

overheadthantheactualprocessingofthedata.
WithMapReduce,thedataprocessingalgorithmisinsteadmovedtothenodesthatstore
thedata.Thedataprocessingalgorithmexecutesinparallelonthesenodes,thereby
eliminatingtheneedtomovethedatafirst.Thisnotonlysavesnetworkbandwidthbutit
alsoresultsinalargereductioninprocessingtimeforlargedatasets,sinceprocessing
smallerchunksofdatainparallelismuchfaster.

MapandReduceTasks
AsingleprocessingrunoftheMapReduceprocessingengineisknownasaMapReduce
job.EachMapReducejobiscomposedofamaptaskandareducetaskandeachtask
consistsofmultiplestages.Figure6.8showsthemapandreducetask,alongwiththeir
individualstages.

Figure6.8AnillustrationofaMapReducejobwiththemapstagehighlighted.
Maptasks
•map
•combine(optional)
•partition
Reducetasks
•shuffleandsort
•reduce

Map
ThefirststageofMapReduceisknownasmap,duringwhichthedatasetfileisdivided
intomultiplesmallersplits.Eachsplitisparsedintoitsconstituentrecordsasakey-value
pair.Thekeyisusuallytheordinalpositionoftherecord,andthevalueistheactual
record.
Theparsedkey-valuepairsforeachsplitarethensenttoamapfunctionormapper,with
onemapperfunctionpersplit.Themapfunctionexecutesuser-definedlogic.Eachsplit
generallycontainsmultiplekey-valuepairs,andthemapperisrunonceforeachkey-value
pairinthesplit.
Themapperprocesseseachkey-valuepairaspertheuser-definedlogicandfurther
generatesakey-valuepairasitsoutput.Theoutputkeycaneitherbethesameastheinput
keyorasubstringvaluefromtheinputvalue,oranotherserializableuser-definedobject.
Similarly,theoutputvaluecaneitherbethesameastheinputvalueorasubstringvalue
fromtheinputvalue,oranotherserializableuser-definedobject.
Whenallrecordsofthesplithavebeenprocessed,theoutputisalistofkey-valuepairs
wheremultiplekey-valuepairscanexistforthesamekey.Itshouldbenotedthatforan
inputkey-valuepair,amappermaynotproduceanyoutputkey-valuepair(filtering)or
cangeneratemultiplekey-valuepairs(demultiplexing.)Themapstagecanbesummarized
bytheequationshowninFigure6.9.

Figure6.9Asummaryofthemapstage.
Combine
Generally,theoutputofthemapfunctionishandleddirectlybythereducefunction.
However,maptasksandreducetasksaremostlyrunoverdifferentnodes.Thisrequires
movingdatabetweenmappersandreducers.Thisdatamovementcanconsumealotof
valuablebandwidthanddirectlycontributestoprocessinglatency.
Withlargerdatasets,thetimetakentomovethedatabetweenmapandreducestagescan
exceedtheactualprocessingundertakenbythemapandreducetasks.Forthisreason,the
MapReduceengineprovidesanoptionalcombinefunction(combiner)thatsummarizesa
mapper’soutputbeforeitgetsprocessedbythereducer.Figure6.10illustratesthe
consolidationoftheoutputfromthemapstagebythecombinestage.

Figure6.10Thecombinestagegroupstheoutputfromthemapstage.
Acombinerisessentiallyareducerfunctionthatlocallygroupsamapper’soutputonthe
samenodeasthemapper.Areducerfunctioncanbeusedasacombinerfunction,ora
customuser-definedfunctioncanbeused.
TheMapReduceenginecombinesallvaluesforagivenkeyfromthemapperoutput,
creatingmultiplekey-valuepairsasinputtothecombinerwherethekeyisnotrepeated
andthevalueexistsasalistofallcorrespondingvaluesforthatkey.Thecombinerstageis
onlyanoptimizationstage,andmaythereforenotevenbecalledbytheMapReduce
engine.
Forexample,acombinerfunctionwillworkforfindingthelargestorthesmallestnumber,
butwillnotworkforfindingtheaverageofallnumberssinceitonlyworkswithasubset
ofthedata.ThecombinestagecanbesummarizedbytheequationshowninFigure6.11.

Figure6.11Asummaryofthecombinestage.

Partition
Duringthepartitionstage,ifmorethanonereducerisinvolved,apartitionerdividesthe
outputfromthemapperorcombiner(ifspecifiedandcalledbytheMapReduceengine)
intopartitionsbetweenreducerinstances.Thenumberofpartitionswillequalthenumber
ofreducers.Figure6.12showsthepartitionstageassigningtheoutputsfromthecombine
stagetospecificreducers.

Figure6.12Thepartitionstageassignsoutputfromthemaptasktoreducers.
Althougheachpartitioncontainsmultiplekey-valuepairs,allrecordsforaparticularkey
areassignedtothesamepartition.TheMapReduceengineguaranteesarandomandfair
distributionbetweenreducerswhilemakingsurethatallofthesamekeysacrossmultiple
mappersendupwiththesamereducerinstance.
Dependingonthenatureofthejob,certainreducerscansometimesreceivealargenumber
ofkey-valuepairscomparedtoothers.Asaresultofthisunevenworkload,somereducers
willfinishearlierthanothers.Overall,thisislessefficientandleadstolongerjob
executiontimesthaniftheworkwasevenlysplitacrossreducers.Thiscanberectifiedby
customizingthepartitioninglogicinordertoguaranteeafairdistributionofkey-value
pairs.
Thepartitionfunctionisthelaststageofthemaptask.Itreturnstheindexofthereducer
towhichaparticularpartitionshouldbesent.Thepartitionstagecanbesummarizedby
theequationinFigure6.13.

Figure6.13Asummaryofthepartitionstage.
ShuffleandSort
Duringthefirststageofthereducetask,outputfromallpartitionersiscopiedacrossthe
networktothenodesrunningthereducetask.Thisisknownasshuffling.Thelistbased
key-valueoutputfromeachpartitionercancontainthesamekeymultipletimes.
Next,theMapReduceengineautomaticallygroupsandsortsthekey-valuepairsaccording
tothekeyssothattheoutputcontainsasortedlistofallinputkeysandtheirvalueswith
thesamekeysappearingtogether.Thewayinwhichkeysaregroupedandsortedcanbe
customized.
Thismergecreatesasinglekey-valuepairpergroup,wherekeyisthegroupkeyandthe
valueisthelistofallgroupvalues.Thisstagecanbesummarizedbytheequationin
Figure6.14.

Figure6.14Asummaryoftheshuffleandsortstage.
Figure6.15depictsahypotheticalMapReducejobthatisexecutingtheshuffleandsort
stageofthereducetask.