Tải bản đầy đủ - 0 (trang)
Hour 20. Performing Statistical Computing with R

# Hour 20. Performing Statistical Computing with R

Tải bản đầy đủ - 0trang

shortcuttolaunchtheRconsoleonyourdesktop.

andstoringdata.

ThevectoristhemostbasicobjectinR.Avectorisasequenceofobjectsthatbelongto

samedatatype.Thefunctionc()definevectors.Thesymbol<-iscommonlyusedasthe

assignmentoperatorinR.Forexample,theexpressionx<-c(1,2,3,4)definesa

vectorofnumericvaluesandassignsitavariablex.Similarly,thestatement

c(TRUE,FALSE,FALSE)definesanothervectoroflogicalvalues.

Ralsosupportsthecreationofatwo-dimensionalobjectcalledamatrix.Amatrixcanbe

createdinseveraldifferentways.Asoneexample,theexpressionm
expressionm[1,1]<-2assignsavalueof2tothefirstelementofthematrix(the

elementattheintersectionofthefirstrowandthefirstcolumn).

Vectorscontainobjectsbelongingtothesamedatatype,butRalsosupportsaspecialtype

ofvector,calledalist,thatcancontainobjectsofdifferentclasses.Forexample,x
secondelementofthenumericdatatype,anditsfinalelementofthecharacterdatatype.

Dataanalysisgenerallyisperformedonstructureddataintabularform,soRalsosupports

thecreationofdataframesforstoringtabulardata.Forexample,thefollowingRcode

snippetcreatesasimpledataframeforstoringflightdata(flightnumber,origincity,and

destinationcity):

Clickheretoviewcodeimage

flightNumber<-c(1,2,3)

originCity<-c(“NewYork”,“Chicago”,“LosAngeles”)

destinationCity<-c(“Chicago”,“LosAngeles”,“NewYork”)

flightData<-data.frame(flightNumber,originCity,destinationCity)

Figure20.1illustratesthecreationoftheflightdataframeintheRconsoleusingthiscode

snippet.Theprintfunctionprintstheflightdataframetotheconsole.

FIGURE20.1CreatingaflightdataframeinR.

airlineflightsdataintoanRdataframenamedonTimePerfData:

Clickheretoviewcodeimage

Note

TransportationStatistics(RITA)trackstheon-timeperformanceofairline

http://www.transtats.bts.gov/DL_SelectFields.asp?

Table_ID=236&DB_Short_Name=On-Time.Thepreviouscodesnippetuses

theprezippeddatafileforJanuary2014.

PerformingRudimentaryDataAnalysis

acquaintedwiththedataset,thestrfunctioncomesinhandy.Thisfunctiondisplaysthe

structureofthedataset.Forexample,thestr(onTimePerfData)commanddisplays

suchinformationasthenumberofobservations(orrows),variables(orcolumns),anddata

types,alongwithsamplevaluesforeachcolumn(seeFigure20.2).

FIGURE20.2Usingthestrfunctiontoviewthestructureofthedataframe.

Ralsohasgoodgraphicscapabilities.ggplot2isapopularvisualizationpackageintheR

community.Itsquickplot(qplot)functionisagoodstartingpointforfirst-timeRusers

andprovidesmostofthegraphicscapabilities,withasimplersyntax.Thefollowingcode

snippetinstallstheggplot2packageandplotsabarchartshowingthenumberofflightsby

carrier,stackedbycancellationcode(seeFigure20.3):

Clickheretoviewcodeimage

install.packages(“ggplot2”)

library(ggplot2)

qplot(x=onTimePerfData\$UniqueCarrier,

data=onTimePerfData,fill=factor(onTimePerfData\$Cancelled),xlab=“Carrier”,

ylab=“FlightCount”)+labs(fill=“FltCancelled(1=Yes)”)

FIGURE20.3UsinggraphicsinR.

BecauseRfetchesdataintomemory,dataanalysiscapabilitiesarelimitedbytheamount

doesn’tfitwellwiththiscoreprinciple.ThismakesRunsuitableforBigDataanalytics.

analytics?

scientistsandstatisticalresearchers.ThesegroupsofusersarecomfortablewithRand

CRAN,contributedbyastrongRusercommunity.Furthermore,Risunmatchedwhenit

comestoperformingstatisticalcomputationsanddatavisualization.

Overcomingthememorylimitationenablesuserstorunstatisticalcomputationson

DistributedstatisticalcomputationscanberunfromwithintheRenvironment.

EnablingRonHDInsight

YoucanmakeRworkwithBigDatainacoupleways.AnaiveapproachistointegrateR

“WorkingwithMicrosoftAzureHDInsightEmulator,”enablesdeveloperstocreate

MapReduceprogramsinprogramminglanguagesotherthanJava.However,writing

followingRpackages,amongothers:

TherhdfspackageprovidesfunctionsforfilemanagementonHDFS/Azureblob

storage.

TherhbasepackageprovidesfunctionsforinteractionwithHBase.

ThermrpackageenablesRuserstoelegantlyworkwithMapReduceprogrammingfrom

withinthefamiliarRconsole.Themapreduce()functiondefinedinthermrpackageis

thefundamentalfunctionforwritingsimplifiedMapReduceprogramsinRusing

Thefunctionexpectsinputdatasetandalsomapperandreducerfunctiondefinitionstobe

InstallingRonHDInsight

ThecustomScriptActionfeatureintroducedinHour11,“CustomizingHDInsightCluster

withScriptAction,”enablesyoutocustomizeanHDInsightclusterandinstallR.

MicrosoftprovidesasampleRinstallationscriptat

https://hdiconfigactions.blob.core.windows.net/rconfigactionv02/r-installer-v02.ps1.To

fromtheWindowsAzuremanagementportal.OntheScriptActionsscreen,providethe

URLoftheRinstallationscript(seeFigure20.4).

FIGURE20.4InstallingRonanHDInsightcluster.

ThescriptinstallsRversion3.1.1,thermr2_3.1.2package,andtherhdfs_1.0.8package

onallnodesofthecluster.

UsingRwithHDInsight

Asdiscussedearlier,themapreduce()functiondefinedinthermrpackageisthe

ThefollowingcodesnippetdescribesthesignatureoftheMapReducefunction:

Clickheretoviewcodeimage

mapreduce(input,inputformat,output,outputformat,map,reduce)

Thefunctionexpectsinputdatasetandalsomapperandreducerfunctiondefinitionstobe

passedinasarguments.Specifyingtheoutputlocationandoutputformatparameterstothe

mapreducefunctionisoptional.Whentheseparametersarenotspecified,theoutput

datasetisstoredinatemporarylocationonthedefaultfilesystem;youcanobtaina

referencetothislocationusingthefollowingsyntax:

Clickheretoviewcodeimage

output<-mapreduce(input,inputformat,output,outputformat,map,reduce)

theRconsole.Listing20.1showsthisinaction.Themapreducefunctionprovidesa

referencetotheoutputdatasetstoredinatemporarylocationonthedefaultfilesystem.

LISTING20.1UsingthermrPackagetoDetermineFlights,byCarrier

Clickheretoviewcodeimage

library(rmr2)

input.format=make.input.format(format=‘csv’,mode=‘text’,

streaming.format

=NULL,

sep=’,’,col.names=

c(‘Year’,‘FlightDate’,‘UniqueCarrier’,‘TailNum’,‘Origin’,‘Dest’

,‘CRSDepTi

me’,‘DepTime’,‘DepDelay’,‘DepDelayMinutes’,‘DepDel15’,‘ArrDelay’,‘ArrDelayMinutes’,

‘ArrDel15’

),stringsAsFactors=F),

map=function(k,fields){

keyval(fields\$UniqueCarrier,1)

},

reduce=function(carrier,vv){

keyval(carrier,length(vv))

}

)

from.dfs(output)

Tounderstandthestepsinvolvedinusingrmr2,considerthefollowingscenariothat

Figure20.5illustratesthestructureoftheflightdatafile.

FIGURE20.5Flightdatafile.

Tip

Listing20.1providesthesourcecodeforusingthemapreducefunctiondefinedinthe

rmr2packagetocalculatetheflightcount,byUniqueCarrier.Theseparametersare

passedtothefunction:

Thefirstparametertothefunctionspecifiesthelocationoftheinputfileinblob

storage.

Thesecondparameterspecifiestheformatoftheinputfile,indicatingthatit’sa

CSVfile,usingacommaasthefieldseparator.Columnnamesarealsospecified.

Thethirdparameterspecifiesthemapfunction.Thisfunctionusesthekeyval

functiontocreatekey-valuepairs(withauniquecarrierasthekeyand1asthe

value)thatarepassedontothereducefunction.

Thefourthparameterspecifiesthereducefunction.Thisfunctioncountsthe

numberofelementsinthevaluevector(lengthofthevaluevector),groupedby

uniquecarrier.Itisreturnedasthefinaloutputandstoredinatemporarylocationon

thedefaultfilesystem.

Theoutputvariablestoresareferencetothetemporarydefaultfilesystem

variableintomemoryfromthefilesystemandprintstheresulttotheRconsole.

ToexecutethecodeintheListing20.1onanHDInsightcluster,usearemotedesktop

usingtheRconsolelaunchericononthedesktop(seeFigure20.6).

FIGURE20.6LaunchingtheRconsolefromthenamenode.

streamingMapReducejobfromwithintheRconsole.

FIGURE20.7Usingthermr2packagewithHDInsight.

Whenthejobcompletes,youcanprintthefinaloutputtotheRconsolebyusingthe

from.dfsfunction(seeFigure20.8).

FIGURE20.8DisplayingMapReduceoutputintheRconsole.

Summary

ThishourexploredintegrationoftheRprogramminglanguage,usedforstatistical

packagesimplifiestheprocessofwritingMapReduceprogramsusingR.

Q&A

Q.Isitpossibletoprogramamap-onlyjobusingrmr?

A.Yes,specifyingthereducefunctionargumentinthemapreducefunctionis

optional.Leavingoutthereduceargumentmakesthejobmaponly.

A.SubscribingtoRmailinglistsisthebestwaytodiscussrelatedissuesandgethelp.

andquestions.Apartfromtheseresources,youcanpostrelatedquestionsonStack

Overflow.HDInsight-specificissuesandquestionscanalsobepostedontheMSDN

forumforHDInsightathttps://social.msdn.microsoft.com/forums/azure/enUS/home?forum=hdinsight.

### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 20. Performing Statistical Computing with R

Tải bản đầy đủ ngay(0 tr)

×