Tải bản đầy đủ - 0 (trang)
Hour 21. Performing Big Data Analytics with Spark

Hour 21. Performing Big Data Analytics with Spark

Tải bản đầy đủ - 0trang

MLlib,amachinelearninglibrary

GraphX,forgraphanalytics



FIGURE21.1Higher-levelprogrammingtoolsSparkoffers.

Note:SparkHoldstheWorldSpeedRecordofSorting100TBofData

AccordingtobenchmarknumbersreleasedbyDatabricks(thecompany

foundedbythecreatorsofSpark),Sparkperformedadistributedsortof

100TBofdatain23minutes,threetimesfasterthanthepreviousrecord,set

byMapReduce,andusing10timeslesscomputingpower.



InstallingSparkonHDInsight

ThecustomScriptActionfeatureintroducedinHour11,“CustomizingtheHDInsight

ClusterwithScriptAction,”enablesyoutocustomizetheHDInsightclusterandinstall

Spark.

Note

MicrosoftprovidesasampleSparkinstallationscriptat

https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/sparkinstaller-v02.ps1.

ToinstallSparkonHadoop,provisionanewHDInsightclusterwiththeCustomCreate

optionfromtheMicrosoftAzureManagementportal.OntheScriptActionsscreen,

providetheURLoftheSparkinstallationscript(seeFigure21.2).



FIGURE21.2InstallingSparkonanHDInsightcluster.

ThescriptinstallsSparkversion1.0.2onallnodesofthecluster.



SparkProgrammingModel

MapReducehasproventobeahighlysuccessfulprogrammingmodelforperformingdata

analysisondistributedBigDatasets.Withtheincreasedadoptionandpopularityof

MapReduce,theexpectationfromtheMapReduceframeworkhasgrown.However,the

MapReduceframeworkhasnotbeenabletosatisfyanalystsinthefollowingproblem

domains:

Interactivequeryingandadhocdataanalysis

Problemsinvolvingmultipleiterationsoveradataset,suchasgraphprocessingand

machinelearning

Real-timestream-processingscenarios

Theseproblemscenarioshaveasimilarprocessingpatternthatinvolvesmultiplemap

reducejobsrunninginsuccession,witheachstepaccessingintermediateprocessing

resultsfromthepreviousstep.TheMapReduceframeworkwritesouttheintermediate

resultstothefilesystem,whichthenextcomputationstepthenconsumes.Datasharing

betweeniterationsoverthediskinvolvesconsiderabletimespentwritingintermediate

resultsandworkingsetstothedistributedfilesystemandthenreadingthosebackinfor

processingbythenextstep.

Sparkoffersanewprogrammingmodelthatmakesuseofmemoryinsteadofthediskfor

datasharingbetweeniterations.ReducingdiskI/OenablesSparktoachieveprocessing

speedsupto100timesfasterthanthoseofconventionalMapReduce.

However,usingmemoryfordatasharingintroducesnewchallengesformakingthe



programmingmodelfaulttolerant.Onesolutionforachievingfaulttoleranceisto

replicatethememorycontentstoanothernodeoverthenetwork.Thissolutionislikelyto

causeperformanceproblems,evenoverthefastestnetworks,becausereplicationovera

networkisatime-consumingprocess.

Sparkinsteadmakesuseofresilientdistributeddatasets(RDDs)forfaulttolerance.RDDs

aredistributedobjectsthatcanbecachedinmemoryacrossclusternodes.Insteadof

replicatingdata,faulttoleranceisachievedbykeepingtrackofvarioushigh-levelparallel

operatorsappliedonRDDs.Ifanodefails,theRDDisrecomputedonadifferentnodeby

applyingthesequenceofoperationstoasubsetofdata.

Note

Morethan80paralleloperatorscanbeappliedtoRDDs.Commonones

includemap,reduce,groupBy,filter,joins,count,distinct,

max,andmin.

WhenworkingwithMapReduce,programmersprimarilyfocusonbreakingtheproblem

intoasetofMapandReducefunctions.WithSpark,however,theapproachisdifferent.

TheSparkapplicationconsistsofadriverprogram,inwhichtheusereithercreatesanew

RDDbyreadingdatafromafileresidingonadistributedfilesystemorworkswith

existingRDDsbyapplyingvarioustransformationoperations.

UserscanalsochoosetocacheRDDsinmemoryandreusethemacrossoperations,thus

savingondiskI/O.TwotypesofRDDoperationsaretransformationsandactions,defined

asfollows:

Transformationoperation—CreatesanewRDDfromanexistingone.InSpark,

transformationsarelazilyevaluatedwhenanactionrequiresacomputationtobe

performed.Forexample,applyingafilteroperationtoadatasettofilterrecordsbya

filtercriteriondoesnotcauseanycomputationordataprocessingtobeperformed.

Thefilteroperationisnotperformeduntilanactionoperation,suchascount,

whichrequiresaresulttobecomputed,isappliedtothedataset.Lazyevaluation

enablestheSparkenginetooptimizethestepsindataprocessingworkflowand

reducesthememoryfootprintbecauseexpressionsareevaluatedanddataisread

onlywhenneeded.

Actionoperation—Causesacomputationtobeperformed.



LogMiningwiththeSparkShell

TobetterunderstandtheSparkprogrammingmodel,considerascenariothatinvolves

interactivelyloadinglogdata(samplelogdataprovidedwiththeHDInsightcluster)into

memoryandqueryingerrormessagesthatmatchcertaincriteria.Figure21.3showsthe

contentsofthesamplelogdataprovidedwithanHDInsightcluster,locatedat

/example/data/sample.logintheblobstore.



FIGURE21.3SamplelogdataprovidedwithanHDInsightcluster.

Tip

Thefileisafewkilobytesinsize,butasimilarkindofanalysiscaneasily

extendtobiggerdatafiles.

Tobegininteractiveanalysis,launchtheSparkshellbynavigatingtothe

%SPARK_HOME%\bindirectoryonthenamenodeandissuethecommandsparkshell.cmd.ThislaunchestheSparkshellforScala.(TousetheSparkshellforPython,

usethecommandpyspark.cmd.)Figure21.4showstheSparkshellforScala.



FIGURE21.4SparkshellforScala.

Spark’sScalainterpreterrequiressomefamiliaritywiththeScalaprogramminglanguage.

Scalaisanobject-orientedlanguageincorporatingfull-blownfunctionalprogramming

languagefeaturessuchastypeinference,immutability,lazyevaluation,andpattern

matching.Developerscanuseittowriteconcise,cleaner,andbetter-performingcode.As

abeginner,youcangetstartedbyusingScalaas“Javawithoutsemicolons.”

StartbycreatinganewRDDnamedlogFilefromthesample.logfileusingthe

followingScalacodesnippet:

Clickheretoviewcodeimage

vallogFile=sc.textFile(“/example/data/sample.log”)



Next,applyfiltertransformationonthelogFileRDDandcreateanewRDDnamed

errorsEntriesthatcontainsanerrormessagefromthelogfile,usingthefollowing

codesnippet:

Clickheretoviewcodeimage

valerrorsEntries=logFile.filter(_.contains(“[ERROR]”))



Recallthattransformationoperationsarelazilyevaluated;sofar,Sparkhasnotperformed

anydataprocessingorcomputation.

BecausetheerrorEntriesRDDwillbequeriedmultipletimes,youcanaskSparkto

persisttheRDDafterevaluationusingthefollowingcodesnippet:

errorsEntries.persist()



Next,usethecount()actionoperatortocountthenumberoferrorlogentries:

errorsEntries.count()



ThiscausestheSparkenginetoevaluatealltransformationoperationsappliedsofarand

computetheresultofthecountoperation.BecauseyoupersistedtheerrorsEntries

RDD,furtheractionsandtransformationoperatorsappliedontheRDDwillnotrequire



reevaluationoftheRDDfromtheflatfile.

PerformthefollowingactionsonebyoneonthepersistedRDDtocountthenumberof

errorlogentriesrelatedtoSampleClass1andSampleClass4inthelog:

Clickheretoviewcodeimage

errorsEntries.filter(_.contains(“SampleClass1”)).count()

errorsEntries.filter(_.contains(“SampleClass4”)).count()



BlendingSQLQueryingwithFunctionalPrograms

Asdiscussedearlier,SparkalsoprovidesSQLqueryingcapabilitiesusingSparkSQLon

topoftheSparkengine.SQLqueryingusingSparkoccursthroughSparkSQL.Spark

SQLenablesrelationalqueriestobeembeddedinfunctionalprograms.Thisway,

programmerscaneasilyaccessthebestofbothworldsfromthesameprogramming

environment.

Note:SharkWastheFirstProjectThatAllowedSQLQueryingonSpark

SharkwasthefirstprojectthatsupportedrunningSQLqueriesonSpark.

However,SharkworkedbyconvertingMapReduceoperatorsgeneratedby

theHiveoptimizertoSparkoperators.BecausetheHiveoptimizerwas

designedtoworkwithMapReduce,thisapproachdidnotwork.SparkSQL

deviatedfromthisapproachandreliesonanRDD-awarequeryoptimizer,for

betterperformance.

FollowingarethemaincomponentsofSparkSQL:

CatalystOptimizer—Optimizesalogicalqueryplantoachievebetterquery

performance

SparkSQLCoreengine—Executesoptimizedlogicalqueryplansasoperationson

RDDs

SparkSQL—IncludesaHivesupportlayerforinteractionwithHive

SparkSQLoperatesonSchemaRDD,whichisnothingbutschemainformationattachedto

anRDD.TheschemainformationisessentialforRDDstoefficientlybeusedinSQLas

declarativequeries.



HiveComparedtoSparkSQL

HivealreadyprovidesSQLqueryingcapabilitiesonBigData,butitisnotwellsuitedfor

interactivequeryingandadhocanalysis.ThiscanbeattributedtothefactthattheHive

executionenginereliesonMapReduceontopofthedistributedfilesystem.

Ontheotherhand,SparkSQLconvertsSQLqueriestoSparkoperatorsandthususes

Sparkrunningonadistributedfilesystemastheexecutionengine.ThismakesSparkSQL

bettersuitedforinteractivequeryingandadhocanalysisscenariosinvolvingmultiple

iterationsoverthesamesetofdata.

Still,SparkcanalsoworkwithexistingHivetables.ThismakesswitchingfromHiveto



SparkSQLeasy—yousimplyuseadifferentclienttoquerythesamedata.



UsingSQLBlendedwithFunctionalCodetoAnalyzeCrimeData

ToappreciatethebenefitsofusingSparkSQLandSpark’scapabilitytoblendSQLqueries

withfunctionalprograms,considerthefollowingscenarioinvolvingtheanalysisof

robbery-relatedcrimesbylocationinChicago.Theendgoalofthescenarioistolist

robberylocationssortedbynumberofincidents,indescendingorder.

Thescenariousesthe“Crimes—Oneyearpriortopresent”dataset,availableat

https://data.cityofchicago.org/Public-Safety/Crimes-One-year-prior-to-present/x2n5-8w5q.

Thedatasetreflectsreportedincidentsofcrime(withtheexceptionofmurders,forwhich

dataexistsforeachvictim)thathaveoccurredinthecityofChicagooverthepastyear.

Todownloadthedata,clicktheExportbuttonandselecttheCSVoptionfromthe

DownloadAsmenu(seeFigure21.5).



FIGURE21.5DownloadingcityofChicagocrimedata.

TheCSVfilehasthefollowingschema:

Clickheretoviewcodeimage

CASE#,DATEOFOCCURRENCE,BLOCK,IUCR,PRIMARYDESCRIPTION,SECONDARY



DESCRIPTION,

LOCATIONDESCRIPTION,ARREST,DOMESTIC,BEAT,WARD,FBICD,XCOORDINATE,Y

COORDINATE,

LATITUDE,LONGITUDE,LOCATION



ThecolumnsPRIMARYDESCRIPTION(representingtheprimarycrimetype,suchas

ROBBERY)andLOCATIONDESCRIPTIONareofinterestinthisscenario.

Tip

BeforeyouuploadtheCSVfiletotheblobstore,removethefileheader.This

iseasierthantryingtoskipitprogrammaticallyduringprocessing.

Afteryoutrimtheheader,uploadtheCSVfileat/CrimeData/Crimes__One_year_prior_to_present.csvinthedefaultblobstorethattheHDInsight

clusteruses.

LaunchtheSparkshellandstartbycreatingaSQLContext.SQLContextprovides

functionalityforworkingwithSchemaRDDs,turningitintoatableandusingSQLqueries

againstit.

Clickheretoviewcodeimage

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)



ImportSQLContexttogetaccesstoallthepublicSQLfunctionsandimplicit

conversionsusingthefollowingstatement:

importsqlContext._



Createacaseclasstodefinetheschemaofcrimeandlocationcolumns:

Clickheretoviewcodeimage

caseclassCrime(PRIMARY_DESCRIPTION:String,LOCATION_DESCRIPTION:String)



Next,createanewRDDfromtheCSVfilethatyouuploadedtotheblobstoreearlierand

splititusingacommacharactertoseparatethecolumns.Extractcolumns4(primary

description)and6(location)tocreateaSchemaRDD.

Clickheretoviewcodeimage

valcrimes=sc.textFile(“/CrimeData/Crimes__One_year_prior_to_present.csv”).

map(line=>line.split(“,”)).map(p=>Crime(p(4),p(6)))



RegistertheSchemaRDDasatable.ThismakesitavailableforSQLquerying.

Clickheretoviewcodeimage

crimes.registerAsTable(“crimes”)



Next,useSqlContexttoinvokeaSQLqueryagainsttheregisteredtabletoextract

robberylocationssortedbynumberofincidents,indescendingorder.

Clickheretoviewcodeimage

valrobberiesByLocation=sqlContext.sql(“SELECT

LOCATION_DESCRIPTION,COUNT(*)AS

RobberyCountFROMcrimesWHEREPRIMARY_DESCRIPTION=‘ROBBERY’GROUPBY

LOCATION_DESCRIPTIONORDERBYRobberyCountDESC”)



TheSQLqueryreturnsacollectionofrowobjectsasaresult.Fromtheresultset,youcan

accessindividualcolumns(locationandcountofrobberies)andprintthemtotheconsole

usingthefollowingcodesnippet:

Clickheretoviewcodeimage

robberiesByLocation.map(rob=>“Location:”+rob(0)+”RobberyCount:

“+rob(1)).collect().foreach(println)



Figure21.6showstheresultsoftheanalysis.Mostrobberiesseemtobehappeningon

sidewalksandstreets;thefewestappeartobehappeningatairportsandpolicefacility.



FIGURE21.6Analyzingrobbery-relatedcrimesbylocationinthecityofChicago.

ThisscenarioillustratestheintegrationofSparkSQLwithfunctionalprogramming

languages.SparkSQLmakesiteasytomixandmatchSQLquerieswithfunctionalcode,

enablinguserstoworkwiththebestofbothworldswithouthavingtoswitchplatforms.



Summary

Thishourexploredthein-memorycomputingcapabilitiesofSpark.Havingintermediate

workingsetsinmemorycanmakeabigdifferenceinprocessingtimes.Thehouralso

exploredtheSparkprogrammingmodel.Sparkmakesuseofresilientdistributeddatasets

(RDDs)forfaulttolerance.Insteadofreplicatingdata,faulttoleranceisachievedby

keepingtrackofvarioushigh-levelparalleloperatorsappliedonRDDs.Thehour

concludedwithadiscussionontheSQLqueryingcapabilitiesofSparkandexaminedhow

toblendSQLwithfunctionalprograms.



Q&A

Q.CanSparkworkwithoutApacheHadoop?

A.Yes.SparkcanberunoutsidetheHadoopecosystem.Sparkworkswellin

standalonemode,whereitrequiresasharedfilesystemforstorage.

Q.HowdoesSparkprocessdatasetsthatdon’tfitinmemory?

A.Optimalperformanceisachievedwhenadistributeddatasetfitsinmemoryoneach

node.Whenadatasetdoesnotfitinmemory,Sparkusesexternaloperatorsto

processthedataondisk.



Quiz

1.WhataresomescenariosinwhichSparkoutperformsHadoopMapReduce?

2.WhatisanRDD?



Answers

1.SparkoutperformsMapReduceinscenariosthatinvolveinteractivequeryingandad

hocdataanalysis,aswellasproblemsinvolvingmultipleiterationsoveradataset

(forexample,graphprocessingandmachinelearningandreal-timestream

processingscenarios).

2.Resilientdistributeddatasets(RDDs)aredistributedobjectsthatcanbecachedin

memoryacrossclusternodes.Insteadofreplicatingdata,Sparkachievesfault

tolerancebykeepingtrackofvarioushigh-levelparalleloperatorsappliedonRDDs.

Whenanodefails,theRDDisrecomputedonadifferentnodebyapplyinga

sequenceofoperationstoasubsetofdata.Morethan80paralleloperatorscanbe

appliedtoRDDs.Commononesincludemap,reduce,groupBy,filter,

joins,count,distinct,max,andmin.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 21. Performing Big Data Analytics with Spark

Tải bản đầy đủ ngay(0 tr)

×