Tải bản đầy đủ - 0 (trang)
Hour 11. Customizing the HDInsight Cluster with Script Action

Hour 11. Customizing the HDInsight Cluster with Script Action

Tải bản đầy đủ - 0trang

DevelopingScriptAction

ScriptActionscriptsaredevelopedinPowerShell.Tobetterunderstandtheprocessof

scriptdevelopment,considerthefollowingstepsforinstallingGiraphonHDInsight

cluster:

1.ObtainGiraphsourcecodeandbuildGiraphforaspecificversionofHDInsight.

2.DownloadandimporttheHDInsightUtilitiesmodulefile,whichcontains

theHelpermethodsforperformingcommontasks(ThePowerShellscriptdeveloped

laterinListing11.1includesthedownloadURI.)

3.DownloadthezippedGiraphbinaryfilesfromaprivateorpublicfileshare

accessiblefromtheclustertoatemporarylocation.

4.UnzipthedownloadedfiletotheinstallationdirectoryunderC:\apps(C:\hdp

foranHDInsightemulator).

5.Optionally,copythesamplesto/example/jars/onthedefaultfilesystem.

6.ConfiguretheenvironmentvariableGIRAPH_HOMEtopointtotheinstallation

directory.

Note:BuildingGiraphfromSourceIsOptional

BuildingGiraphfromsourceisoptionalbecauseMicrosofthasalready

providedScriptActionsandbinariesforinstallingseveralpopularHadoop

projects,includingGiraph.Forexample,youcanobtainzippedGiraph

binariestargetingHDInsight3.1from

https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph1.2.0.zip.



UsingtheHDInsightUtilitiesModule

ToeasetheprocessofScriptActiondevelopment,ScriptActionprovideshelpermethods

toperformcommoninstallationtasks.TheHDInsightUtilitiesmodulefiledefines

thesemethods.Table11.1listssomecommonlyusedmethodsandtheirpurposes.



TABLE11.1CommonHelperMethodsinScriptAction

Listing11.1providesthescripttoinstallGiraph,leveraginghelpermethodsinTable11.1.

LISTING11.1ScripttoInstallGiraph

Clickheretoviewcodeimage



#Sourceanddestinationpaths

$HDInsightUtilitiesDownloadLoc=“https://hdiconfigactions.blob.core.windows.net/

configactionmodulev01/HDInsightUtilities-v01.psm1”

$giraphSrcLoc=“https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/

giraph-1.2.0.zip”

$installationFolder=(Get-Item“$env:HADOOP_HOME”).parent.FullName;

$HDInsightUtilitiesLoc=$installationFolder+”\HDInsightUtilities.psm1”

#Stopexecution,ifgiraphinstallationdirectoryalreadyexists

#ThisensuresthatscriptcanbeexecutedagainsafelywhenaVMisreimaged

if(Test-Path($installationFolder+‘'+‘giraph-1.2.0’))

{

Write-HDILog“Installationdirectoryalreadyexists!”;

exit;

}

#DownloadHDInsightUtilitiesmodule

$utilWebClient=New-ObjectSystem.Net.WebClient;

$utilWebClient.DownloadFile($HDInsightUtilitiesDownloadLoc,$HDInsightUtilitiesLoc);

#ImportHDInsightUtilitiesmodule

Import-Module$HDInsightUtilitiesLoc;



#DownloadGiraphbinaryfiletotemporarylocation

$temporaryZipFile=$env:temp+‘'+[guid]::NewGuid()+‘.zip’;

Save-HDIFile-SrcUri$giraphSrcLoc-DestFile$temporaryZipFile;

#Unzipthedownloadedfiletoinstallationdirectory

Expand-HDIZippedFile-ZippedFile$temporaryZipFile-UnzipFolder

$installationFolder;

#Deletetemporaryfiles

Remove-Item$temporaryZipFile;

[Environment]::SetEnvironmentVariable(‘GIRAPH_HOME’,$installationFolder+‘'

+

‘giraph-1.2.0’,‘Machine’);



RememberthefollowingconsiderationswhendevelopingScriptAction:

Thelocationcontainingscriptfilesshouldbeaccessibletoclusternodes.Usingblob

storageisbest.

Safelyexecutingthesamescriptmorethanonceonanodeshouldbepossible.This

ishelpfulwhenaVMintheclustergetsreimagedandrequiresre-executionofthe

ScriptAction.

UseeitherC:\AppsorD:\astheinstallationpath.OtherlocationsontheC:

drivearereservedlocationsandshouldnotbeused.

IfascriptmodifiestheHadoopconfigurationoranoperatingsystemsetting,restart

HDInsightservices,ifneeded.



ConsumingScriptAction

ScriptactioncanbeconfiguredforexecutionfromtheWindowsAzureManagement

Portal,PowerShell,orHDInsight.NETSDKwhenyouprovisionanewcluster.



UsingScriptActionwiththeAzureManagementPortal

WhenprovisioningtheHDInsightclusterwithacustomcreateoption,youcanaddScript

ActionforclustercustomizationbyspecifyingtheScriptActionnameandscriptandthen

selectingthenodesonwhichthescriptistobeexecuted(seeFigure11.1).Youcanadd

multipleScriptActionsbyclickingtheAddScriptActionbutton.



FIGURE11.1AddingScriptActionstocustomizetheHDInsightcluster.



UsingScriptActionwithPowerShell

YoucanalsoinvokeScriptActionsusingtheAdd-AzureHDInsightScriptAction

PowerShellcmdletwhileprovisioningtheHDInsightclusterwithPowerShell.

ThefollowingcodesnippetillustrateshowtouseAddAzureHDInsightScriptActiontoaddScriptActiontoaclusterconfiguration

object:

Clickheretoviewcodeimage

$clusterConfig=Add-AzureHDInsightScriptAction-Config$clusterConfig-Name

“GiraphInstaller”-ClusterRoleCollectionHeadNode-Uri<
uri>>



TheAdd-AzureHDInsightScriptActioncmdletacceptsasargumentsthecluster

configurationobject,theScriptActionname,thenodesonwhichthecustomizationscript

istoberun(HeadNode,DataNode,orboth),thescriptURI,and,optionally,anyinput

parametersthatscriptrequires.

TheclusterconfigurationobjectcreatedissuppliedasanargumenttotheNewAzureHDInsightClustercmdlet.Therestofthecluster-provisioningprocessis

similartotheprocessdescribedinHour6,“GettingStartedwithHDInsight,Provisioning

YourHDInsightServiceCluster,andAutomatingHDInsightClusterProvisioning.”



UsingScriptActionwithHDInsight.NETSDK

Inaddition,youcaninvokeScriptActionswhenprovisioningtheclusterwiththe

HDInsight.NETSDK.RecallfromHour6thatyoucanusethe

ClusterCreateParametersclasstospecifyclusterconfigurationpropertiesduring

clusterprovisioning.YoucanaddScriptActionstotheConfigActionspropertyofthe

ClusterCreateParametersclassusingtheScriptActionclassobject,asthe

followingcodesnippetillustrates:

Clickheretoviewcodeimage

clusterConfig.ConfigActions.Add(newScriptAction(“GiraphInstaller”,new

ClusterNodeType[]{ClusterNodeType.HeadNode},

newUri(“<>”),null));



HeretheScriptActionconstructoracceptsasargumentstheScriptActionname,the

nodesonwhichthecustomizationscriptistoberun,thescriptURI,and,optionally,any

inputparametersthescriptrequires.

Theclusterconfigurationobjectcreatedissuppliedasanargumenttothe

CreateClusterfunction(definedinHDInsightClientclass),andtherestofthe

clusterprovisioningprocessissimilartotheprocessdescribedinHour6.



RunningaGiraphJobonaCustomizedHDInsightCluster

AcustomizedclusterwithGiraphinstalledcanrunGiraphjobs.Tounderstandthesteps

involvedinrunningaGiraphjob,considerthe

SimpleShortestPathsComputationexamplefromApacheGiraph

documentation,athttp://giraph.apache.org/quick_start.html.Theexamplecalculatesthe

lengthoftheshortestpathfromagivensourcenodetoallthenodesinagraphusingthe

followingdirectedgraphdata(seeFigure11.2):

[0,0,[[1,1],[3,3]]]

[1,0,[[0,1],[2,2],[3,1]]]

[2,0,[[1,2],[4,4]]]

[3,0,[[0,3],[1,1],[4,4]]]

[4,0,[[3,4],[2,4]]]



FIGURE11.2Visualrepresentationofgraphdatausedinthe

SimpleShortestPathComputationexample.

Thisdataisintheformat[source_node,source_node_value,[[dest_node,

edge_value],...]].Forexample,thefirstlinestatesthatsourcenode0hasavalue

of0andisconnectedtodestinationnode1viaedge,withaweightof2,anddestination

node3viaedge,withaweightof3.Thesimpleshortestpathexampletreatsnode1asthe

sourcenodeandcalculatestheshortestpathfromittoallothernodes.

Listing11.2providesthePowerShellscripttosubmitaGiraphjob.Thesimpleshortest

pathcomputationalgorithmhasbeenimplementedasoneoftheexamplesingiraphexamples.jar.Table11.2liststheargumentstheGiraphjobrequires.The

GiraphRunnerhelperclassrunsGiraphjobsbyconsumingtheargumentsprovided.



TABLE11.2ParameterstheGiraphJobRequires

BeforesubmittingtheGiraphjob,copythegiraph-examples.jarfrom

%GIRAPH_HOME%\giraph-examples.jarto/example/jars/giraph-



examples.jarinblobstorage.Also,savethegraphvertexdatatoatextfileandcopy

itto/example/data/tiny_graph.txtinblobstorage.Youcanuseacloud

storageexplorertoolorPowerShellforthispurpose(seeHour8,“StoringDatain

MicrosoftAzureStorageBlob”).SavethePowerShellscriptinListing11.2toafileand

executethescriptusingAzurePowerShell.Whenjobexecutioncomplees,theresultsare

storedin/giraphoutput/shortestpathintwofiles:part-m-00001and

part-m-00002.

LISTING11.2SubmittingaGiraphJob

Clickheretoviewcodeimage

$hdInsightClusterName=“HDInsightClusterName”

$giraphExamplesJarFile=“/example/jars/giraph-examples.jar”

$giraphJobParameters=

“org.apache.giraph.examples.SimpleShortestPathsComputation”,

”-ca”,“mapred.job.tracker=headnodehost:9010”,

”-vif”,

“org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat”,

”-vip”,“/example/data/tiny_graph.txt”,

”-vof”,

“org.apache.giraph.io.formats.IdWithValueTextOutputFormat”,

”-op”,”/giraphoutput/shortestpath”,

”-w”,“2”

#Createthedefinition

$giraphJobDefinition=New-AzureHDInsightMapReduceJobDefinition-JarFile

$giraphExamplesJarFile-ClassName“org.apache.giraph.GiraphRunner”-Arguments

$giraphJobParameters

#GetHDInsightcrendentials

$creds=Get-Credential

#Submitjobforexecution

$giraphJob=Start-AzureHDInsightJob-Cluster$hdInsightClusterNameJobDefinition

$giraphJobDefinition-Credential$creds

#PrintjobID

Write-Host“JobID:”$giraphJob.JobId-ForegroundColorGreen



Thefollowingisthecombinedoutputfromthetwooutputfiles:

01.0

45.0

22.0

10.0

31.0

Youcaninferthattheshortestpathfromnode1tonode0hasavalue(travelcostor

weight)of1.Similarly,theshortestpathfromnode1tonode4hasavalueof5(thesum

oftheshortestdistancefromnode1tonode3andfromnode3tonode4).



TestingScriptActionwithHDInsightEmulator

YoucantestmostScriptActionsonanHDInsightemulatorbymanuallyinvokingthe

PowerShellscript.InHDInsightemulator,Hadoopcomponentsareinstalledinthe

C:\hdpfolderinsteadoftheC:\appsfolderusedwithanHDInsightcluster.Youcan

accommodatethisinthescriptbyextractingthecorrectinstallationpathfromHadoop

homebyusingtheexpression(Get-Item

"$env:HADOOP_HOME").parent.FullName.

Caution

Youmightnotbeabletotestscriptsthathaveadependencyonspecific

HDInsightclusterservicesthatarenotavailableinanHDInsightemulator.

SuchscriptscanbetestedonlyonanHDInsightcluster.



Summary

ThishourexploredtheprocessofScriptActiondevelopmenttoinstallcustomHadoop

projectsonanHDInsightcluster.GiraphisaHadoopprojectthatiswellsuitedtograph

computations.ThehourdemonstratedthestepsinvolvedindevelopingaScriptActionto

installGiraphonanHDInsightclusteranduseittoprocessagraphproblem.



Q&A

Q.WhenisaScriptActioninvoked?

A.Scriptactionsareinvokedduringtheclusterprovisioningprocess,afterthecluster

creationiscompletebutbeforetheclusterbecomesoperational.Scriptactionsare

alsoinvokedwhenaVMintheclusterisreimaged.

Q.WhataretheadvantagesofusingGiraphforgraphprocessing,comparedto

directlywritingMapReduceprograms?

A.Graphproblemsofteninvolvemultipleiterationsandstatetransitions.Hoppingand

transmissionofmessagesbetweenverticesareothercommoncomputation

operations.ModellingsuchproblemsinconventionalMapReduceisnotatrivial

task.HavingaMapReducejobforeachiterationrequirescreatingmultiplejobs,

whichleadstomultiplekey/valuepair–basedread/writeoperationsforsavingand

retrievingstateamongmultiplegraphiterations.Giraphattemptstosolvethese

problemsbyprovidingaMapReduce–basedgraph-processingsolutionto

convenientlymodelgraphproblems.Modellingagraphproblemasasetofvertices

andedgesinsteadofmappersandreducersallowsforasimplerandmoreelegant

implementationofgraphproblems.Verticescanbothsendmessagestoother

verticesandreceivemessagessentfrompreviousiterationsinthecomputation,also

calledSupersteps.InitialGiraphimplementationskeptgraphstateinmemoryfor

theentiretimeduringacomputation,toaccomplishthiswithminimumdisk

read/writeoperations.However,theimplementationchangedlater.Without-of-core

capabilityimplemented,beyondacertainlimit,Giraphpartitionsandmessagesnow

getwrittentodisk.Partitionsareswappedbetweendiskandmemory,basedon



usage.Hence,Giraphdeliverswhenitcomestoperforminggraphcomputations.



Quiz

1.Whatphysicallocationscanyouusetoinstallcustomcomponents?

2.WhichhelpermethodcanyouusetocheckwhetheranHDInsightserviceisin

runningstate?



Answers

1.UseeitherC:\AppsorD:\astheinstallationpath.OtherlocationsontheC:

drivearereservedandshouldnotbeused.

2.YoucanusetheGet-HDIServiceRunninghelpermethodtoverifywhetheran

HDInsightserviceisinrunningstate,specifyingthenameastheparameter.



PartIV:QueryingandProcessingBig

DatainHDInsight



Hour12.GettingStartedwithApacheHiveandApacheTez

inHDInsight

WhatYou’llLearninThisHour:

IntroductiontoApacheHive

GettingStartedwithApacheHiveinHDInsight

AzureHDInsightToolsforVisualStudio

ProgrammaticallyUsingtheHDInsight.NETSDK

IntroductiontoApacheTez

Inthelastfewhours,youlearnedindetailaboutHadoop(HDFSfordatastorageand

MapReduceasaprogrammingframework).Youalsolookedintothewritingprograms

(Mapper,Reducer,andDriver)thattargettheMapReduceframework.Butyoumighthave

noticedthatwritingaMapReduceprogram,whetherinJavaoranyotherprogramming

languageofyourchoiceusingHadoopStreaming,isnoteasy.Itrequiresagreatdealof

expertise,adifferentapproachinprogramming,andasignificantamountoftimeforthe

development.Thetaskbecomesevenmoredifficultifyouhavetwoormoredatasetsto

jointogetaresult.

ThisiswhereApacheHivecomesinhandy.ApacheHiverunsontopoftheHadoop

frameworkandenablesyoutowriteyourdataprocessinglogic(withjoins,groups,sorts,

andsoon)inaStructuredQueryLanguage(SQL)-likedeclarativelanguage(whichyou

probablyhavebeenfamiliarwithforseveralyears).WiththehelpofApacheHive,you

canwriteyourqueriesinjustafewlines,savingyoutimeandeffort.Forexample,a

simplequeryoflessthan10linestranslatestoaMapReduceprogramwithmorethan100

linesofcode.

Inthishour,youdelveintohowtouseApacheHive,thedifferentwaysofwritingand

executingHiveQLqueriesinHDInsight,andhowApacheTezimprovestheoverall

performanceseveralfoldforHiveQLqueries.



IntroductiontoApacheHive

YoucanthinkofHiveasaSQLabstractionlayeroverHadoopMapReducewithaSQLlikequeryengine.HiveenablesyoutowritedataprocessinglogicorqueriesinaSQL-like

declarativelanguage,calledHiveQL,thatissimilartoSQL,asinthecaseofrelational

databasesystems.AsyoucanseeinFigure12.1,whenyouexecutetheHiveQLquery,

HivetranslatesthequeryintoaseriesofequivalentMapReduce,savingyouthetimeand

effortofwritingactualMapReducejobsonyourown.ThenHiveexecutesthequery.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 11. Customizing the HDInsight Cluster with Script Action

Tải bản đầy đủ ngay(0 tr)

×
x