Tải bản đầy đủ - 0 (trang)
Hour 13. Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog

Hour 13. Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog

Tải bản đầy đủ - 0trang

availableAzureStorageExplorertools

(http://blogs.msdn.com/b/windowsazurestorage/archive/2014/03/11/windows-azurestorage-explorers-2014.aspx).

Note

Fordemonstrationpurposes,weusedtheCloudXplorertoolheretoupload

thesefilesinthefolder/OnTimePerformance/Data/(seeFigure13.1).



FIGURE13.1Uploadingon-timeperformanceflightdatatoAzureStorageBlob.

CloudXplorerhasasimpleandintuitiveWindowsFileExplorer–like

interfacetoassistyouinexploringyourWindowsAzurestorage(itsupports

copyandpaste,draganddrop,andsoon).Youcandownloadevaluation

editionofthistoolhere:http://clumsyleaf.com/products/cloudxplorer.

Downloadlookupdatarelatedtoairlinecodesanddescriptionsfrom

http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRLINE_ID.Then

uploadittothefolder/OnTimePerformance/AirlineLookup/intheAzure

StorageBlob(seeFigure13.2).



FIGURE13.2UploadinglookupdatatoAzureStorageBlob.

Downloadlookupdatarelatedtoairportcodesanddescriptionsfrom

http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRPORT_ID.Then

uploadittothefolder/OnTimePerformance/AirportLookup/intheAzure

StorageBlob(seeFigure13.2).

Downloadlookupdatarelatedtoflightcancellationcodesanddescriptionsfrom

http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_CANCELLATION.Then

uploadittothefolder/OnTimePerformance/CancellationLookup/inthe

AzureStorageBlob(seeFigure13.2).

TheHDInsightclusterinAzureleveragesAzureStorageBlobasthedefaultfilesystem

fordatastorage,whereasHDInsightemulatorusesHDFSonthelocaldiskasthedefault

filesystem.



RunningExamplesonHDInsightEmulator

AllexamplesinthishourhavebeentestedonanAzureHDInsightclusterandshouldrun

asisonanAzureHDInsightcluster.

WerecommendthatyouusetheAzureHDInsightclusterservicewheneverpossibleso

youhaveaccesstoallthefeaturesofHDInsight.

GOTO RefertoHour6toreviewhowtoprovisionHDInsightclustereasily

andquickly.

Remembertodeletetheclusterwheneveryouarenotusingit,tosaveoncost.Youcan

provisionitagainwhenyouneedit.Inthemeantime,storeyourdataonthedefaultAzure

StorageBlobandusetheSQLdatabaseasaHivemetastoresoyoudon’tloseanything

afterclusterdeletion.

TryItYourself:RunningFlightPerformanceDataonHDInsightEmulator

IfyouarerunningexamplesfromthishouronHDInsightEmulator,performthe

followingstepstocreatetherequiredfoldersandcopyfilestothelocalHDFS:

1.Downloadandcopytheon-timeflightperformancedatafiles(usingthelink

providedearlierinthissection)totheC:\OnTimePerformance\Data

folderonthemachinewhereyouhaveHDInsightemulatorrunning.

2.DownloadandcopythefileL_AIRLINE_ID.csv(usingthelinkprovided

earlierinthissection)totheC:\OnTimePerformance\AirlineLookup

folderonthemachinewhereyouhaveHDInsightemulatorrunning.

3.DownloadandcopythefileL_AIRPORT_ID.csv(usingthelinkprovided

earlierinthissection)totheC:\OnTimePerformance\AirportLookup

folderonthemachinewhereyouhaveHDInsightemulatorrunning.

4.DownloadandcopythefileL_CANCELLATION.csv(usingthelink

providedearlierinthissection)tothe

C:\OnTimePerformance\CancellationLookupfolderonthe

machinewhereyouhaveHDInsightemulatorrunning.

5.DownloadtheCSVSerDefilefromhttp://ogrodnek.github.io/csv-serde/and

copyittotheC:\OnTimePerformancefolderonthemachinewhereyou

haveHDInsightemulatorrunning.

6.NowexecutethesecommandsattheHadoopcommandprompttouploadthese

filesappropriatelytoHDFS:

Clickheretoviewcodeimage

hadoopfs-mkdir/OnTimePerformance

hadoopfs-copyFromLocalC:\OnTimePerformance\AirlineLookup

/OnTimePerformance/

hadoopfs-copyFromLocalC:\OnTimePerformance\AirportLookup

/OnTimePerformance/

hadoopfs-copyFromLocalC:\OnTimePerformance\CancellationLookup

/OnTimePerformance/



hadoopfs-copyFromLocalC:\OnTimePerformance\Data/OnTimePerformance/

hadoopfs-copyFromLocalC:\OnTimePerformance\csv-serde-1.1.2.jar

/OnTimePerformance/



7.ReplaceADDJARwasb:///OnTimePerformance/csv-serde1.1.2.jar;withADDJAR/OnTimePerformance/csv-serde1.1.2.jar;whereverreferencedinthescriptsprovidedinthehour.

8.Copytheexamplefolderfromthishour’scontentfoldertotheC:driveofthe

machinewhereyouhaveHDInsightemulatorrunning.

9.CreatethesefoldersinHDFSiftheyarealreadynotavailable:

Clickheretoviewcodeimage

hadoopfs-mkdir/example/data/internaldemo01

hadoopfs-mkdir/example/data/internaldemo02

hadoopfs-mkdir/example/data/externaldemo01



10.Copythesefilesforinternalandexternaltableexamples:

Clickheretoviewcodeimage

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

internaldemo01/sample01.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

internaldemo01/sample02.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

internaldemo02/sample01.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

internaldemo02/sample02.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

internaldemo02/sample03.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

externaldemo01/sample01.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

externaldemo01/sample02.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

externaldemo01/sample03.log

hadoopfs-copyFromLocalc:\example\data\sample.log/example/data/

externaldemo01/sample04.log



ComparisonwithRDBMSDatabases

ApacheHiveincludesHiveQL,adeclarativequerylanguagemuchliketheStructured

QueryLanguage(SQL)ofRDBMS.IthassomedifferencesfromRDBMS,however:

WhenyoucreateaHivetable,thedatalivesinunstructuredfiles(unlikestructured

tablestorage,inthecaseofRDBMS).

CREATETABLEinHiveprovidesawaytogivestructuretotheunstructureddata

storedinthesedatafilesontheWASBorlocalHDFS.

AnexternaltableinHiveprovidesarelationalviewonexistingfilesineitherWASB

orthelocalHDFS.Basically,itreferencesthefilesanddoesnotcreateanothercopy

ofthedataorcontrolit(moreonthislaterinthishour).



DataforatableinHivecanbestoredeitherastextfilesorassequencefiles.While

storingdatainafile,youcanchoosetoseparatevalueswithdelimitersorbasedona

customserializer/deserializer(moreonthislaterinthishour).

Likearelationaltable,aHivetablecanbepartitioned,clustered,andsorted

Hiveenablesyoutodefineindexesandmaintainslimitedstatistics(onlyfilesize).

Hence,itsupportslimitedoptimizationsonly(suchaspartitionelimination).

HivehasInsertsonly,noUpdates,andnotransactionsupport(however,asofthis

writing,upcomingApacheHivereleasesareexpectedtoofferthesefeatures).

Hiveisdesignedforbatchexecution.(WithApacheTez,interactiveexecutionofthe

queriesisallowed,althoughit’sstillnotcomparabletotheinteractiveresponsetime

RDBMSsystemsprovide.)

HiveQLislimitedtowhatcanbeexecutedusingMapReducejobs(limitedjoins).



DatabaseorSchema

ApacheHiveincludesadefaultdatabasewhereyoucancreateobjects,forbetter

segregation,toavoidtablenamecollisions,orforbettermanageability.Thisisespecially

relevantifyouareworkingonmultipleapplicationsandyouneedtocreateanapplicationspecificdatabaseorschema.ApacheHiveincludescommandstocreate,alter,anddrop

databaseswheneverneeded.Forexample,withthehelpoftheCREATEDATABASE

command,youcancreateadatabasenamedairlinesdb:

Clickheretoviewcodeimage

CREATEDATABASEIFNOTEXISTSairlinesdb

COMMENT‘Thisdatabaseornamespaceorschemacontainsallthetablesrelated

to

on-time-performanceflightinformation’;



NotetheuseoftheoptionalIFNOTEXISTSkeyword.Whenspecified,itcreatesthe

databaseonlyifadatabasewiththesamenamedoesnotexistalready.Also,bydefault,

therootdirectoryforthedatabaseissetas/hive/warehouse/
name>.db.Ifyouwanttochangeittosomeotherfolderlocation,youcanusethe

LOCATIONclausewhencreatingadatabasetospecifyadifferentlocation.

TheSHOWDATABASEScommandshowsallthedatabasesavailableinHive:

SHOWDATABASES;



Ifyouwanttofilterdatabasesbasedonsomepattern,youcanusetheLIKEclauseto

filterallthedatabasesthatstartwithair:

SHOWDATABASESLIKE‘air*’;



AlthoughtheALTERDATABASEcommandexistsinHive,itsroleislimitedto

modifyingdatabasepropertiesandthedatabaseowner.Youcannotmodifyother

properties,suchasthedatabasenameandlocation.

Clickheretoviewcodeimage

ALTERDATABASEairlinesdbSETDBPROPERTIES(‘CreatedBy’=‘Arshad’,

‘ModifiedBy’=



‘Manpreet’);



TheDESCRIBEDATABASEcommandshowsmetadatainformationaboutthedatabase,

suchasitsname,comment(ifspecifiedwhilecreatingit),rootdirectorylocationonthe

WASBorHDFSwheredatafortableswillbestored,anddatabasecreator.

Clickheretoviewcodeimage

DESCRIBEDATABASEairlinesdb;



YoucanusetheDROPDATABASEcommandtodeleteadatabase.TheIFEXISTS

clauseisoptional;itdropsthedatabaseonlyifitexists.Ifyoudon’tspecifytheIF

EXISTSclauseandyoutrytodeleteadatabasethatdoesn’texist,yougettheerror

FAILED:SemanticException[Error10072]:Databasedoesnot

exist::

Clickheretoviewcodeimage

DROPDATABASEIFEXISTSairlinesdb;



Bydefault,theDROPDATABASEcommandusestheRESTRICTclause,whichmeans,it

willdropthedatabaseonlyifithasnotables.Ifyouwanttodropadatabasebyfirst

deletingitstableandthendeletingthedatabase,youneedtousetheCASCADEclause:

Clickheretoviewcodeimage

DROPDATABASEIFEXISTSairlinesdbCASCADE;



Figure13.3showstheexecutionresultofthesecommandsontheHivecommand-line

interface(CLI).



FIGURE13.3Executionresultfordatabasemanagementscripts.

GOTO ThescriptsinFigure13.3canbeexecutedusinganymethodofHive

queryexecution,asdiscussedinHour12,“GettingStartedwithApacheHive

andApacheTezinHDInsight.”

Wediscussedhowyoucancreatedatabasesinyourcluster.Ifyouhavemultiple

databases,youmustswitchthedatabasecontextsothatyouworkontherightsetoftables



intherightdatabase.TheUSEcommandchangesthedatabasecontext:

Clickheretoviewcodeimage

USEairlinesdb;

SEThive.cli.print.current.db=true;

SHOWTABLES;

USEdefault;

SHOWTABLES;

SEThive.cli.print.current.db=false;



WhenyouareworkingontheHivecommand-lineinterface,youmightfinditconfusingto

workwithmultipledatabases.Inthiscase,youcanusetheSET

hive.cli.print.current.db=true;commandtoincludeyourcurrentdatabase

nameaspartofHiveprompt(seeFigure13.4).YoucanusetheSET

hive.cli.print.current.db=false;commandtorevertthissetting.



FIGURE13.4Executionresultofscriptsfordatabaseusage.

Note:Two-PartNamingConvention

InsteadofusingUSEtochangethecontextorworking

database,youcanuseatwo-partnamingconventionsuchas
name>..ThisconventionworksperfectlyfineinHive.

Note:DatabaseVersusSchema

UnlikeRDBMS,inwhichaschemaisanobjectinsideadatabase,SCHEMA

andDATABASEareinterchangeableinHive—theybothrefertothesame

thing.ThismeansthatthecommandsCREATEDATABASE

adventureworksandCREATESCHEMAadventureworksarethe

same.Ifadatabasewiththeadventureworksnamealreadyexistsand

youtryexecutingCREATESCHEMAadventureworks,thecommand

willthrowanerror.



UsingTablesinHive

ApacheHiveenablesyoutocreatetwotypesoftables,internal,ormanaged,tables

(becausetheyareactuallymanagedbyHive)andexternaltables.Thetypeoftableyou

createdependsonwherethedatashouldresideandhowitshouldbemanagedand

controlledwhenthetableisdeleted.

Thenextsectionsdiscussinternalandexternaltables.



InternalTable

Whenyoucreateaninternalormanagedtable,Hivemanagesthedata.Thismeansthat

Hivecopiesdatafromthesourcefilestoasubdirectory(eachtablehasasubdirectorywith

thesamenameasthetable)oftherootdirectoryofthedatabase(bydefault,location

/hive/warehouse/)atthetimedataisloaded.

Tip

Specifyingadifferentlocationthanthedefaultlocationdoesnotcopydata—

itonlypointstothatlocation.

Whenyoudropaninternalormanagedtable,Hivedeletestheassociateddataandthe

metadatainformationaboutthetable.

NotethattheTRUNCATETABLEcommandisapplicableforinternalormanagedtables

only,todeletealltherowsofthetableorspecificpartitionsofthetable.

Beforeyouseehowaninternaltableworks,let’screatetwofolders(internaldemo01

andinternaldemo02)inthe/example/data/locationandthencopythe

sample.logfile(/example/data/sample.log)tothesefolders(seeFigure

13.5).



FIGURE13.5Datasetsforaninternaltabledemo.

Tip

Youcancopymultiplecopiesofthisfile,forbetterclarity.Forexample,we

havecreatedmultiplecopiesofsample.log,withnamessuchas

sample01.logandsample02.logandsoon.

Next,createaninternaltable(theabsenceoftheEXTERNALclauseintheCREATE

TABLEcommandindicatesaninternaltable)intheairlinesdbdatabasewithout

specifyingtheLOCATIONclause.Thismeansthatthistablewillbecreatedbydefaultin

the/hive/warehouse/airlinesdb.dbfolder:

Clickheretoviewcodeimage

USEairlinesdb;

DROPTABLEIFEXISTSlog4jLogsInternal01;

CREATETABLElog4jLogsInternal01

(

col1string,

col2string,

col3string,

col4string,

col5string,

col6string,

col7string

)

ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘‘

STOREDASTEXTFILE;



Tip

Ifyoudon’thavethisdatabasealready,createitwiththescriptshownearlier

inthesection,“DatabaseorSchema.”

Usingthefollowingcommand,youcanloaddatafromtheinternaldemo01folder

intothistable.Notethatthismovesthedatafromtheinternaldemo01foldertothe

log4jlogsinternal01subfolder(samenameasthetable)insidethe

/hive/warehouse/airlinesdb.dbfolder(seeFigure13.6).

Clickheretoviewcodeimage

LOADDATAINPATH‘/example/data/internaldemo01’INTOtable

log4jLogsInternal01;



FIGURE13.6Internaltablestoredinthedefaultlocation.

Nowlet’screateanotherinternaltablebyspecifyingtheLOCATIONclause,asinthis

script:

Clickheretoviewcodeimage

USEairlinesdb;

DROPTABLEIFEXISTSlog4jLogsInternal02;

CREATETABLElog4jLogsInternal02

(

col1string,

col2string,

col3string,

col4string,

col5string,

col6string,



col7string

)

ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘‘

STOREDASTEXTFILE

LOCATION‘/example/data/internaldemo02/’;



Thescriptcreatesaninternaltable,butitpointstotheoriginallocationof

/example/data/internaldemo02/insteadofpointingtothedefault

/hive/warehouse/location(seeFigure13.7).Unlikethepreviousinternaltableyou

created,inthiscase,nonewfoldergetscreatedinthedefaultfolder

/hive/warehouse/withthenameofthetable.



FIGURE13.7Internaltablestoredoutsidethedefaultlocation.

Youcanexecutethisscripttoverifytheinternaltablesandtheirproperties:

Clickheretoviewcodeimage

SHOWTABLES;

DESCRIBEEXTENDEDlog4jLogsInternal01;

DESCRIBEEXTENDEDlog4jLogsInternal02;



Nowlet’sdropthesetwotablesandanalyzetheimpactonthedata.Todropthesetables,

youcanrunthisscript:

Clickheretoviewcodeimage

DROPTABLEIFEXISTSlog4jLogsInternal01;

DROPTABLEIFEXISTSlog4jLogsInternal02;



AsyouseeinFigure13.8,eventheassociateddatagotdeletedwhenyoudroppedthe

internaltables.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 13. Programming with Apache Hive, Apache Tez in HDInsight, and Apache HCatalog

Tải bản đầy đủ ngay(0 tr)

×