Tải bản đầy đủ - 0 (trang)
Hour 20. Performing Statistical Computing with R

Hour 20. Performing Statistical Computing with R

Tải bản đầy đủ - 0trang

shortcuttolaunchtheRconsoleonyourdesktop.



LoadingExternalData

KnowledgeofRdatatypesisessentialbeforeyoucanstartloadingdatainRforanalysis.

UnderstandingRdatatypesisvitaltoselectingthemostappropriatedatatypeforloading

andstoringdata.

ThevectoristhemostbasicobjectinR.Avectorisasequenceofobjectsthatbelongto

samedatatype.Thefunctionc()definevectors.Thesymbol<-iscommonlyusedasthe

assignmentoperatorinR.Forexample,theexpressionx<-c(1,2,3,4)definesa

vectorofnumericvaluesandassignsitavariablex.Similarly,thestatement

c(TRUE,FALSE,FALSE)definesanothervectoroflogicalvalues.

Ralsosupportsthecreationofatwo-dimensionalobjectcalledamatrix.Amatrixcanbe

createdinseveraldifferentways.Asoneexample,theexpressionm
expressionm[1,1]<-2assignsavalueof2tothefirstelementofthematrix(the

elementattheintersectionofthefirstrowandthefirstcolumn).

Vectorscontainobjectsbelongingtothesamedatatype,butRalsosupportsaspecialtype

ofvector,calledalist,thatcancontainobjectsofdifferentclasses.Forexample,x
secondelementofthenumericdatatype,anditsfinalelementofthecharacterdatatype.

Dataanalysisgenerallyisperformedonstructureddataintabularform,soRalsosupports

thecreationofdataframesforstoringtabulardata.Forexample,thefollowingRcode

snippetcreatesasimpledataframeforstoringflightdata(flightnumber,origincity,and

destinationcity):

Clickheretoviewcodeimage

flightNumber<-c(1,2,3)

originCity<-c(“NewYork”,“Chicago”,“LosAngeles”)

destinationCity<-c(“Chicago”,“LosAngeles”,“NewYork”)

flightData<-data.frame(flightNumber,originCity,destinationCity)



Figure20.1illustratesthecreationoftheflightdataframeintheRconsoleusingthiscode

snippet.Theprintfunctionprintstheflightdataframetotheconsole.



FIGURE20.1CreatingaflightdataframeinR.

YoucanconvenientlycreatedataframesfromCSVfilesusingtheread.csv()

function.Forexample,thefollowingcodesnippetreadstheon-timeperformanceof

airlineflightsdataintoanRdataframenamedonTimePerfData:

Clickheretoviewcodeimage

onTimePerfData<-read.csv(“C:\Data\On_Time_On_Time_Performance_2014_1.csv”,

header=TRUE)



Note

TheResearchandInnovativeTechnologyAdministration,Bureauof

TransportationStatistics(RITA)trackstheon-timeperformanceofairline

flights.Thisinformationisavailablefordownloadandanalysisat

http://www.transtats.bts.gov/DL_SelectFields.asp?

Table_ID=236&DB_Short_Name=On-Time.Thepreviouscodesnippetuses

theprezippeddatafileforJanuary2014.

TheparameterHeader=TRUEintheread.csvfunctionspecifiesthatthefirstlineof

theCSVfilecontainstheheaderinformation.



PerformingRudimentaryDataAnalysis

AfterthedataisloadedintomemoryfromtheCSVfile,itisavailableforanalysis.Toget

acquaintedwiththedataset,thestrfunctioncomesinhandy.Thisfunctiondisplaysthe

structureofthedataset.Forexample,thestr(onTimePerfData)commanddisplays

suchinformationasthenumberofobservations(orrows),variables(orcolumns),anddata

types,alongwithsamplevaluesforeachcolumn(seeFigure20.2).



FIGURE20.2Usingthestrfunctiontoviewthestructureofthedataframe.

Ralsohasgoodgraphicscapabilities.ggplot2isapopularvisualizationpackageintheR

community.Itsquickplot(qplot)functionisagoodstartingpointforfirst-timeRusers

andprovidesmostofthegraphicscapabilities,withasimplersyntax.Thefollowingcode

snippetinstallstheggplot2packageandplotsabarchartshowingthenumberofflightsby

carrier,stackedbycancellationcode(seeFigure20.3):

Clickheretoviewcodeimage

install.packages(“ggplot2”)

library(ggplot2)

qplot(x=onTimePerfData$UniqueCarrier,

data=onTimePerfData,fill=factor(onTimePerfData$Cancelled),xlab=“Carrier”,

ylab=“FlightCount”)+labs(fill=“FltCancelled(1=Yes)”)



FIGURE20.3UsinggraphicsinR.



IntegratingRwithHadoop

BecauseRfetchesdataintomemory,dataanalysiscapabilitiesarelimitedbytheamount

ofmemoryavailable.Furthermore,thecentralideaofHadoophasbeentomove

computingtodata,notdatatocomputing.Loadingdatapersistedelsewhereintomemory

doesn’tfitwellwiththiscoreprinciple.ThismakesRunsuitableforBigDataanalytics.

Withtheselimitations,whyistheresomuchinterestinenablingRonHadoop?WhyisR

notbeingreplacedbyanyothercomponentfromtheHadoopecosystemforBigData

analytics?

ThesimpleanswertothequestionistheoverwhelmingpopularityofRamongdata

scientistsandstatisticalresearchers.ThesegroupsofusersarecomfortablewithRand

willnoteasilyswitchtoadifferentplatform.Morethan5,000Rpackagesareavailableon

CRAN,contributedbyastrongRusercommunity.Furthermore,Risunmatchedwhenit

comestoperformingstatisticalcomputationsanddatavisualization.

EnablingRonHadoopalsoprovidesthefollowingbenefits:



Overcomingthememorylimitationenablesuserstorunstatisticalcomputationson

fulldatasetsinsteadofsampledata.

DistributedstatisticalcomputationscanberunfromwithintheRenvironment.



EnablingRonHDInsight

YoucanmakeRworkwithBigDatainacoupleways.AnaiveapproachistointegrateR

withHadoopMapReduceviaHadoopstreaming.Hadoopstreaming,introducedinHour9,

“WorkingwithMicrosoftAzureHDInsightEmulator,”enablesdeveloperstocreate

MapReduceprogramsinprogramminglanguagesotherthanJava.However,writing

MapReduceinRisnotatrivialtaskfordatascientists.

RHadoopaimstomaketheprocesseasierbyprovidingacollectionofRpackageswith

applicationprogramminginterfaces(APIs)forHadoopintegration.Itincludesthe

followingRpackages,amongothers:

ThermrpackageprovidesAPIsforHadoopMapReducefunctionalityinR.

TherhdfspackageprovidesfunctionsforfilemanagementonHDFS/Azureblob

storage.

TherhbasepackageprovidesfunctionsforinteractionwithHBase.

ThermrpackageenablesRuserstoelegantlyworkwithMapReduceprogrammingfrom

withinthefamiliarRconsole.Themapreduce()functiondefinedinthermrpackageis

thefundamentalfunctionforwritingsimplifiedMapReduceprogramsinRusing

RHadoop.

Thefunctionexpectsinputdatasetandalsomapperandreducerfunctiondefinitionstobe

passedinasarguments.LatersectionscoverRHadoopinmoredetail.



InstallingRonHDInsight

ThecustomScriptActionfeatureintroducedinHour11,“CustomizingHDInsightCluster

withScriptAction,”enablesyoutocustomizeanHDInsightclusterandinstallR.

MicrosoftprovidesasampleRinstallationscriptat

https://hdiconfigactions.blob.core.windows.net/rconfigactionv02/r-installer-v02.ps1.To

installRonHadoop,provisionanewHDInsightclusterwiththeCustomCreateoption

fromtheWindowsAzuremanagementportal.OntheScriptActionsscreen,providethe

URLoftheRinstallationscript(seeFigure20.4).



FIGURE20.4InstallingRonanHDInsightcluster.

ThescriptinstallsRversion3.1.1,thermr2_3.1.2package,andtherhdfs_1.0.8package

onallnodesofthecluster.



UsingRwithHDInsight

Asdiscussedearlier,themapreduce()functiondefinedinthermrpackageisthe

fundamentalfunctionforwritingsimplifiedMapReduceprogramsinRusingRHadoop.

ThefollowingcodesnippetdescribesthesignatureoftheMapReducefunction:

Clickheretoviewcodeimage

mapreduce(input,inputformat,output,outputformat,map,reduce)



Thefunctionexpectsinputdatasetandalsomapperandreducerfunctiondefinitionstobe

passedinasarguments.Specifyingtheoutputlocationandoutputformatparameterstothe

mapreducefunctionisoptional.Whentheseparametersarenotspecified,theoutput

datasetisstoredinatemporarylocationonthedefaultfilesystem;youcanobtaina

referencetothislocationusingthefollowingsyntax:

Clickheretoviewcodeimage

output<-mapreduce(input,inputformat,output,outputformat,map,reduce)



Theoutputdatasetisreadfromtheoutputlocationusingthefrom.dfsfunction.The

functionreadstheoutputvariableintomemoryfromthefilesystemandprintstheresultto

theRconsole.Listing20.1showsthisinaction.Themapreducefunctionprovidesa

referencetotheoutputdatasetstoredinatemporarylocationonthedefaultfilesystem.

LISTING20.1UsingthermrPackagetoDetermineFlights,byCarrier

Clickheretoviewcodeimage



library(rmr2)

output=mapreduce(input=”/OnTimePerformance/Data/OTP_Without_Header.csv”,

input.format=make.input.format(format=‘csv’,mode=‘text’,

streaming.format

=NULL,

sep=’,’,col.names=

c(‘Year’,‘FlightDate’,‘UniqueCarrier’,‘TailNum’,‘Origin’,‘Dest’

,‘CRSDepTi

me’,‘DepTime’,‘DepDelay’,‘DepDelayMinutes’,‘DepDel15’,‘ArrDelay’,‘ArrDelayMinutes’,

‘ArrDel15’

),stringsAsFactors=F),

map=function(k,fields){

keyval(fields$UniqueCarrier,1)

},

reduce=function(carrier,vv){

keyval(carrier,length(vv))

}

)

from.dfs(output)



Tounderstandthestepsinvolvedinusingrmr2,considerthefollowingscenariothat

determinesthenumberofflights,byuniquecarrier,fromadatafileofflightinformation.

Figure20.5illustratesthestructureoftheflightdatafile.



FIGURE20.5Flightdatafile.

Tip

AlthoughthefigureshowstheCSVfilewithaheader,CSVfilesdonot

actuallysupportheaders.YoumustremovetheheaderfromtheCSVfile

beforeexecutingthermr2workload.

Listing20.1providesthesourcecodeforusingthemapreducefunctiondefinedinthe

rmr2packagetocalculatetheflightcount,byUniqueCarrier.Theseparametersare



passedtothefunction:

Thefirstparametertothefunctionspecifiesthelocationoftheinputfileinblob

storage.

Thesecondparameterspecifiestheformatoftheinputfile,indicatingthatit’sa

CSVfile,usingacommaasthefieldseparator.Columnnamesarealsospecified.

Thethirdparameterspecifiesthemapfunction.Thisfunctionusesthekeyval

functiontocreatekey-valuepairs(withauniquecarrierasthekeyand1asthe

value)thatarepassedontothereducefunction.

Thefourthparameterspecifiesthereducefunction.Thisfunctioncountsthe

numberofelementsinthevaluevector(lengthofthevaluevector),groupedby

uniquecarrier.Itisreturnedasthefinaloutputandstoredinatemporarylocationon

thedefaultfilesystem.

Theoutputvariablestoresareferencetothetemporarydefaultfilesystem

locationthatcontainsthefinaloutput.Thefrom.dfsfunctionreadstheoutput

variableintomemoryfromthefilesystemandprintstheresulttotheRconsole.

ToexecutethecodeintheListing20.1onanHDInsightcluster,usearemotedesktop

connectiontoremotelylogintothenamenodeoftheclusterandlaunchtheRconsole

usingtheRconsolelaunchericononthedesktop(seeFigure20.6).



FIGURE20.6LaunchingtheRconsolefromthenamenode.

PastethecodeinListing20.1intotheRconsole(seeFigure20.7).ThistriggersaHadoop

streamingMapReducejobfromwithintheRconsole.



FIGURE20.7Usingthermr2packagewithHDInsight.

Whenthejobcompletes,youcanprintthefinaloutputtotheRconsolebyusingthe

from.dfsfunction(seeFigure20.8).



FIGURE20.8DisplayingMapReduceoutputintheRconsole.



Summary

ThishourexploredintegrationoftheRprogramminglanguage,usedforstatistical

computing,withHadoop.YoucansetupRonanHDInsightclusterusingtheScript

Actionfeature.RHadoopprovidespackagestointegrateRwithHadoop.Thermr2

packagesimplifiestheprocessofwritingMapReduceprogramsusingR.



Q&A

Q.Isitpossibletoprogramamap-onlyjobusingrmr?

A.Yes,specifyingthereducefunctionargumentinthemapreducefunctionis

optional.Leavingoutthereduceargumentmakesthejobmaponly.

Q.HowcanIgethelponissuesrelatedtoRandRHadoop?

A.SubscribingtoRmailinglistsisthebestwaytodiscussrelatedissuesandgethelp.

YoucanfindmoredetailsontheRmailinglistsathttp://www.rproject.org/mail.htmllink.ForRHadoop,theHadoopGooglegroup

https://groups.google.com/forum/#!forum/rhadoopisapublicforumfordiscussions

andquestions.Apartfromtheseresources,youcanpostrelatedquestionsonStack

Overflow.HDInsight-specificissuesandquestionscanalsobepostedontheMSDN

forumforHDInsightathttps://social.msdn.microsoft.com/forums/azure/enUS/home?forum=hdinsight.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 20. Performing Statistical Computing with R

Tải bản đầy đủ ngay(0 tr)

×