Tải bản đầy đủ - 0 (trang)
Hour 7. Exploring Typical Components of HDFS Cluster

Hour 7. Exploring Typical Components of HDFS Cluster

Tải bản đầy đủ - 0trang

FIGURE7.1ComponentsofanHDFScluster.

Thenamenodealsokeepstrackofaliveandavailablenodesinaclusterbymeansofa

heartbeatitreceivesfromindividualdatanodesatperiodicintervals.

Thenamenodealsorespondstoclientrequests.Whenclientapplicationsreachouttothe

namenodewithdataread/writerequests,thenamenoderespondsbyprovidingclients

withthenecessarymetadataandalistofdatanodesthatstoretherequireddata.Clients

subsequentlytalkdirectlytorelevantdatanodes(seeFigure7.2).Thenamenodeitself

doesnotstoretheactualdata;itisonlyametadataproviderforclientdatarequests.



FIGURE7.2Theclientgetsalistofdatanodesfromthenamenodeandinteracts

directlywithdatanodes.



WhytheSecondaryNameNodeIsNotaStandbyNode

Contrarytoitsname,thesecondarynamenodeisnotastandbynamenode.Tounderstand

thepurposeofasecondarynamenode,youmustunderstandhowanamenodestores

metadatarelatedtotheadditionorremovalofblocksinthefilesystem.

ThenamenodestorestheHDFSmetadatainformationinametadatafiletitledfsimage.

Thisimagefileisnotupdatedoneveryadditionorremovalofablockinthefilesystem.

Instead,theseadd/removeoperationsareloggedandmaintainedinaseparatelogfile.



AppendingupdatestoaseparatelogachievesfasterI/O.

Theprimarypurposesofthesecondarynamenodearetoperiodicallydownloadthename

nodeimageandlogfiles,createanewimagebymergingtheimageandlogfiles,and

uploadthenewimagebacktothenamenode.Theprocessofgeneratinganewfsimage

fromamergeoperationiscalledthecheckpoint(seeFigure7.3).Withoutasecondary

namenode,thenamenodeitselfwouldhavetoperformthistime-consumingworkevery

timeitrestarted.Becausethesecondarynamenodeperformsthistaskperiodically,the

namenodecanrestartfaster.



FIGURE7.3Asecondarynamenodegeneratinganewfsimage.

Thesecondarynamenodeisalsoresponsibleforbackingupthenamenodeimage.



StandbyNameNode

BeforeHadoop2.0.0,aclustercouldhaveonlyonenamenode.Thus,thenamenodewas

thesinglepointoffailureinaHadoopcluster.Namenodefailuregravelyimpactedcluster

availability.Likewise,whenanamenodehadtobetakendownformaintenanceor

upgrades,theentireclusterwasunavailable.

TheHDFShighavailabilityfeatureintroducedinHadoop2.0addressedthisproblem.

Nowaclustercanhavetwonamenodesinanactive-passiveconfiguration(onenodeis

activeandtheothernodeisinstandbymode).Theactivenodeandthestandbynode

remaininsynch.Iftheactivenodefails,thestandbynodetakesoverandpromotesitself

totheactivestate.

Thedatanodesareconfiguredwiththelocationofbothnamenodes.Datanodessend

periodicheartbeatstobothnamenodesasaconfirmationthatthedatanodeisoperational

andthattheblockshostedbyitareavailable.Thisisimportantinensuringfasterfailover.

Figure7.4illustratesthismajordifferencebetweenthesecondarynamenodeandthe

standbynamenode.Asyoucansee,datanodesarenotawareofthesecondarynamenode.



FIGURE7.4Thedifferencesbetweenthesecondarynamenodeandthestandbyname

node.

Tip

Aclusterconfiguredforhighavailabilitydoesnotrequireasecondaryname

node.Thetaskofperformingcheckpointandbackupbecomesredundantwith

thepresenceofastandbynamenode,sotheclusterisnotrequiredtohavea

separatesecondarynamenode.Infact,youcanreusethehardwareusedfor

thesecondarynamenodetohostthestandbynode.



HDInsightClusterArchitecture

HDInsightdeviatesfromtheconventionalHadooparchitecturebyseparatingthestorage

fromthecluster.HDInsightreliesonAzureblobstorage(wasb)asthedefaultfilesystem

forstoringdatainsteadofusingHDFS.Useofblobstorageprovidesthefollowing

additionalbenefits:



BlobstorageallowssafedecommissioningofHDInsightclusterwithoutthelossof

userdata.

Becausedatastorageisnotdependentonthecluster,youcaneasilydecommission

theclusterwheneveritisnotinuse,thusprovidingadditionalcostbenefits.

Multipleclustersandotherappscanaccessthesameblobstorage.

GOTO Hour8,“StoringDatainMicrosoftAzureStorageBlob,”explores

AzureStorageBlobandlooksathowlocalityofdatatocomputenodesis

maintained,inspiteoftheseparationofdatafromthecluster.

Note:HDFSIsStillSupported

ThetraditionalapproachofstorageusingHDFSisstillsupported,butitis

moretransitoryinnature,andanydatastoredinHDFSislostwhenacluster

isdecommissioned.Hence,usingHDFSforuserdatastorageisnot

recommended.AbettersolutionistousetraditionalHDFStostoreany

temporaryjobdata.Furthermore,HDInsightalsousestraditionalHDFSto

storeintermediateresultsandtemporarydatafromMapReducejobsandother

processes.

Figure7.5providesavisualrepresentationofthetypicalHDInsightclusterarchitecture.



FIGURE7.5HDInsightclusterarchitecture.

TheHDInsightheadnodeisconceptuallyequivalenttothetraditionalApacheHadoop

namenode,discussedintheprevioussections.Theheadnoderunsthefollowingcore

Hadoopservices:



Namenode

Secondarynamenode

ResourceManagerandMapReduceJobHistoryServer(previouslypartoftheJob

Trackerservice)

Note:WhereAretheJobTrackerandTaskTracker?

InHadoop2.0(withtheYARNframework),JobTrackercapabilitiesare

dividedintocomponents:theResourceManager(responsibleforresource

management)andtheMapReduceApplicationMaster(managesthe

application’slifecycleandterminateswhenanapplication/MapReducejobis

complete).TheJobHistoryServerhasthetaskofprovidinginformation

aboutcompletedjobs.Similarly,theNodeManagerhasreplacedtheTask

Trackeronthecomputenodes.

TryItYourself:ExploringtheServicesonHDInsightNameNode

FromtheServicesconsoleonthenamenodeandcomputenodes,youcanview

theHDInsightservicescorrespondingtotheHadoopdaemonslistedearlier:

1.LogintotheAzuremanagementportal.

2.ClickHDInsightontheleftpanetobringupthelistofHDInsightclusters.

3.ClicktheHDInsightclusterofinterest.

4.Fromthetopofthepage,clickConfiguration.

5.Ifaremotedesktopconnectionisnotenabled,fromthebottomofthepage,

clickEnableRemotetoconfigureauserforremotelogin(seeFigure7.6);

otherwise,clickConnecttoconnecttothenamenode.



FIGURE7.6EnablingaremotedesktopconnectiontotheHDInsightnamenode.

6.Thisbringsupthedesktopofthecluster’snamenode(seeFigure7.7).



FIGURE7.7HDInsightnamenodedesktop.



7.LaunchtheServicesconsolebytypingservices.mscintheRunwindow

orselectingViewLocalServicesfromAdministrativetoolsintheControl

Panel.

8.ExaminetheApacheHadoopservicesintheServiceswindows;lookforthe

namenode,secondarynamenode,resourcemanager,and

MapReduceJobHistoryServerservices(seeFigure7.8).



FIGURE7.8HDInsightnamenodeservices.

9.Toviewtheservicesrunningondatanodes,opentheHadoopNameNode

Statuslinkonthedesktop.

10.ClicktheDatanodestabletoviewthedatanodeinformation(seeFigure7.9).



FIGURE7.9HDInsightdatanodeinformation.

11.Usearemotedesktopconnectiontoremotelylogintooneofthedatanodes

fromwithinthenamenode.

12.LaunchtheServicesconsoleagainonthedatanodeandlookforthe

datanodeandnodemanagerApacheHadoopservices(seeFigure7.10).



FIGURE7.10HDInsightdatanodeservices.



HighAvailabilityinHDInsight

HDInsightclusters(fromHDIVersion2.1onward)alsosupportasecondstandby/passive

headnode,tosupportthenamenodehighavailabilityfeaturediscussedintheearlierpart

ofthishour.HAreliesonquorum-basedstorageandfailoverdetectionusingZooKeeper,

asthefollowingsectionsexplain.



HABasedonQuorum-BasedStorage

HDInsight3.1isbasedonHDP2.1,whichutilizestheQuorumJournalManagerto

achievehighavailability.Inthisconfiguration,theactivenodewriteseditlog

modificationstothejournalmachines.

Tobeconsideredsuccessful,ajournallogmodificationshouldbewrittentothemajority

ofthejournalnodes.Thestandby,orpassive,namenodekeepsitsstateinsynchwiththe

activenodebyconsumingthefilesystemjournalloggedbytheactivenamenode.

Ifafailoveroccurs,thestandbynamepromotesitselftoactivestateonlyafterensuring

thatithasreadalltheeditsfromthejournalnodes.



FailoverDetectionUsingZooKeeper

TheZooKeeperFailoverController(ZKFC)servicerunningonthenamenodesis

responsiblefordetectingafailureandrecognizinganeedtofailovertothestandbynode

(seeFigure7.11).ZKFCusestheZooKeeperserviceforcoordinationandtodetectaneed

forfailover.



FIGURE7.11HighavailabilityinHDInsightusingQuorum-basedstorageand

ZooKeeper.

BecauseZKFCrunsonbothactiveandstandbynamenodes,itrisksasplit-brainscenario



inwhichbothnodestrytoachieveactivestateatthesametime.Topreventthis,ZKFC

triestoobtainanexclusivelockontheZooKeeperservice.Theservicethatsuccessfully

obtainsalockisresponsibleforfailingovertoitsrespectivenamenodeandpromotingit

toactivestate.

AnexaminationoftheentriesinthehostsfilerevealsnamesandIPaddressesofthe

machinesinthecluster(seeFigure7.12).



FIGURE7.12Examiningthehostentriesinthenamenode.

Ifyoulogintozookepernode0andexaminetheservicesrunningintheServicesconsole,

youcanseethatzkServer(theZooKeeperService)isindeedrunningonthemachine(see

Figure7.13).



FIGURE7.13WindowsServiceforZooKeeperserver.

TheHadoopServiceAvailabilityStatuspagehelpsdeterminetheactivenamenodeand

thestatusofservicesonthenode.Toexaminethestatusofservicesanddeterminethe

activenamenode,double-clicktheHadoopServiceAvailabilityStatusicononthename

nodedesktop.ThislaunchestheHadoopServiceAvailabilityStatuspage(seeFigure

7.14).YoucanseeinFigure7.14thatheadnode0istheactivenamenodeandisrunning

thefollowingservices:

Namenode

Resourcemanager

Jobhistory

Templeton

Oozieservice



Metastore

Hiveserver2



FIGURE7.14CheckingtheHadoopserviceavailabilitystatus.



Summary

Inthishour,youexploredthetypicalcomponentsofaHadoopclusterandunderstood

theirimportanceinthecontextofHDInsight.Youalsosawwhythesecondarynamenode

isnotactuallyastandbynamenode.Inaddition,youlearnedwhyblobstorageis

recommendedoverHDFSforuserdatastorageinHDInsightCluster.TraditionalHDFSis

transitoryinnatureandisbettersuitedforstoringintermediateprocessingresults.The

hourconcludedwithanoverviewofthenamenodeHAfeaturesupportedbyHDInsight

service.



Q&A

Q.Areanyothermechanismsavailableforattaininghighavailability?

A.Yes,asimilarapproachisbasedonsharedstorageinsteadofthejournalnodes.Ina

sharedstoragemechanism,insteadofloggingtojournalnodes,theactivenodelogs

arecordtoafileinsharedstoragethatthestandbynodethenreads.

Q.Doesswitchingtoahigh-availabilityconfigurationincurextracostbecauseit

involvesprovisioningaseparatestandbynamenode?

A.Switchingtoahigh-availabilityconfigurationdoesnotincurextracostsifthe

defaultlarge-size(A3)headnodeisusedforprovisioning.Choosingahigher

configurationfortheheadnode(extra-largeandabove)doesinvolveadditional



costswithhigh-availabilityconfiguration.Refertohttp://azure.microsoft.com/enus/pricing/details/hdinsight/formoredetailsonpricing.

Q.HowcandatabetransferredtoandfromtraditionalHDFS?

A.YoucanaccesstraditionalHDFSusinghdfs:///.

TheHDFScommandsfordatatransferandretrievalworkasexpected.Forexample,

youcanusethefollowingcommandtocopyfilestoHDFS:

Clickheretoviewcodeimage

hadoopfs-copyFromLocalC:\data\*.txthdfs:///



WithouttheURIspecified,thiscommandcopiesthefilestothedefaultfilesystem

(blobstorage,inanHDInsightcluster).



Quiz

1.Whatisthepurposeofthesecondarynamenode,andhowdoesitdifferfromthe

standbynamenode?

2.WhatmajorchangedoesHDInsightbringaboutinstoringuserdataonthecluster?

3.WhatarethemainfunctionsofanamenodeinanHDFScluster?



Answers

1.Theprimarypurposeofthesecondarynamenodeistoperformcheckpointsand

backupthenamenodeimage.Thesecondarynamenodeisnotactuallyastandby

namenode;itdoesnotcontributetoattainingnamenodehighavailability.

2.HDInsightreliesonAzureblobstorageasthedefaultfilesystemandseparatesuser

datafromthecluster.

3.Thenamenodeisresponsibleforhandlingnamespaceandblockmanagement,

keepingtrackofactivedatanodes,andrespondingtoclientrequests.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 7. Exploring Typical Components of HDFS Cluster

Tải bản đầy đủ ngay(0 tr)

×