Tải bản đầy đủ
Chapter 7. Big Data Storage Technology

Chapter 7. Big Data Storage Technology

Tải bản đầy đủ

explained.Thelastmajortopicofthechapterisin-memorystorage,whichfacilitatesthe
processingofstreamingdataandcanholdentiredatabases.Thesetechnologiesenablea
shiftfromtraditionalon-disk,batch-orientedprocessingtoin-memoryrealtime
processing.

On-DiskStorageDevices
On-diskstoragegenerallyutilizeslowcosthard-diskdrivesforlong-termstorage.On-disk
storagecanbeimplementedviaadistributedfilesystemoradatabaseasshowninFigure
7.1.

Figure7.1On-diskstoragecanbeimplementedwithadistributedfilesystemora
database.

DistributedFileSystems
Distributedfilesystems,likeanyfilesystem,areagnostictothedatabeingstoredand
thereforesupportschema-lessdatastorage.Ingeneral,adistributedfilesystemstorage
deviceprovidesoutofboxredundancyandhighavailabilitybycopyingdatatomultiple
locationsviareplication.
Astoragedevicethatisimplementedwithadistributedfilesystemprovidessimple,fast
accessdatastoragethatiscapableofstoringlargedatasetsthatarenon-relationalin
nature,suchassemi-structuredandunstructureddata.Althoughbasedonstraightforward
filelockingmechanismsforconcurrencycontrol,itprovidesfastread/writecapability,
whichaddressesthevelocitycharacteristicofBigData.
Adistributedfilesystemisnotidealfordatasetscomprisingalargenumberofsmallfiles
asthiscreatesexcessivedisk-seekactivity,slowingdowntheoveralldataaccess.Thereis
alsomoreoverheadinvolvedinprocessingmultiplesmallerfiles,asdedicatedprocesses
aregenerallyspawnedbytheprocessingengineatruntimeforprocessingeachfilebefore
theresultsaresynchronizedfromacrossthecluster.

Duetotheselimitations,distributedfilesystemsworkbestwithfewerbutlargerfiles
accessedinasequentialmanner.Multiplesmallerfilesaregenerallycombinedintoa
singlefiletoenableoptimumstorageandprocessing.Thisallowsthedistributedfile
systemstohaveincreasedperformancewhendatamustbeaccessedinstreamingmode
withnorandomreadsandwrites(Figure7.2).

Figure7.2Adistributedfilesystemaccessingdatainstreamingmodewithnorandom
readsandwrites.
Adistributedfilesystemstoragedeviceissuitablewhenlargedatasetsofrawdataareto
bestoredorwhenarchivingofdatasetsisrequired.Inaddition,itprovidesaninexpensive
storageoptionforstoringlargeamountsofdataoveralongperiodoftimethatneedsto
remainonline.Thisisbecausemorediskscansimplybeaddedtotheclusterwithout
needingtooffloadthedatatoofflinedatastorage,suchastapes.Itshouldbenotedthat
distributedfilesystemsdonotprovidetheabilitytosearchthecontentsoffilesasstandard
out-of-the-boxcapability.

RDBMSDatabases
Relationaldatabasemanagementsystems(RDBMSs)aregoodforhandlingtransactional
workloadsinvolvingsmallamountsofdatawithrandomread/writeproperties.RDBMSs
areACID-compliant,and,tohonorthiscompliance,theyaregenerallyrestrictedtoa
singlenode.Forthisreason,RDBMSsdonotprovideout-of-the-boxredundancyandfault
tolerance.
Tohandlelargevolumesofdataarrivingatafastpace,relationaldatabasesgenerallyneed
toscale.RDBMSsemployverticalscaling,nothorizontalscaling,whichisamorecostly
anddisruptivescalingstrategy.ThismakesRDBMSslessthanidealforlong-termstorage
ofdatathataccumulatesovertime.
Notethatsomerelationaldatabases,forexampleIBMDB2pureScale,SybaseASE
ClusterEdition,OracleRealApplicationClusters(RAC)andMicrosoftParallelData
Warehouse(PDW),arecapableofbeingrunonclusters(Figure7.3).However,these
databaseclustersstillusesharedstoragethatcanactasasinglepointoffailure.

Figure7.3Aclusteredrationaldatabaseusesasharedstoragearchitecture,whichisa
potentialsinglepointoffailurethataffectstheavailabilityofthedatabase.
Relationaldatabasesneedtobemanuallysharded,mostlyusingapplicationlogic.This
meansthattheapplicationlogicneedstoknowwhichshardtoqueryinordertogetthe
requireddata.Thisfurthercomplicatesdataprocessingwhendatafrommultipleshardsis
required.
ThefollowingstepsareshowninFigure7.4:
1.Auserwritesarecord(id=2).
2.Theapplicationlogicdetermineswhichsharditshouldbewrittento.
3.Itissenttothesharddeterminedbytheapplicationlogic.
4.Theuserreadsarecord(id=4),andtheapplicationlogicdetermineswhichshard
containsthedata.

5.Thedataisreadandreturnedtotheapplication.
6.Theapplicationthenreturnstherecordtotheuser.

Figure7.4Arelationaldatabaseismanuallyshardedusingapplicationlogic.
ThefollowingstepsareshowninFigure7.5:
1.Auserrequestsmultiplerecords(id=1,3)andtheapplicationlogicisusedto
determinewhichshardsneedtoberead.
2.ItisdeterminedbytheapplicationlogicthatbothShardAandShardBneedtobe
read.
3.Thedataisreadandjoinedbytheapplication.
4.Finally,thedataisreturnedtotheuser.

Figure7.5Anexampleoftheuseoftheapplicationlogictojoindataretrievedfrom
multipleshards.
Relationaldatabasesgenerallyrequiredatatoadheretoaschema.Asaresult,storageof
semi-structuredandunstructureddatawhoseschemasarenon-relationalisnotdirectly
supported.Furthermore,witharelationaldatabaseschemaconformanceisvalidatedatthe
timeofdatainsertorupdatebycheckingthedataagainsttheconstraintsoftheschema.
Thisintroducesoverheadthatcreateslatency.
Thislatencymakesrelationaldatabasesalessthanidealchoiceforstoringhighvelocity
datathatneedsahighlyavailabledatabasestoragedevicewithfastdatawritecapability.
Asaresultofitsshortcomings,atraditionalRDBMSisgenerallynotusefulastheprimary
storagedeviceinaBigDatasolutionenvironment.

NoSQLDatabases
Not-onlySQL(NoSQL)referstotechnologiesusedtodevelopnextgenerationnonrelationaldatabasesthatarehighlyscalableandfault-tolerant.Thesymbolusedto
representNoSQLdatabasesisshowninFigure7.6.

Figure7.6ThesymbolusedtorepresentaNoSQLdatabase.
Characteristics
BelowisalistoftheprincipalfeaturesofNoSQLstoragedevicesthatdifferentiatethem
fromtraditionalRDBMSs.Thislistshouldonlybeconsideredageneralguide,asnotall
NoSQLstoragedevicesexhibitallofthesefeatures.
•Schema-lessdatamodel–Datacanexistinitsrawform.
•Scaleoutratherthanscaleup–Morenodescanbeaddedtoobtainadditional
storagewithaNoSQLdatabase,incontrasttohavingtoreplacetheexistingnode
withabetter,higherperformance/capacityone.
•Highlyavailable–Thisisbuiltoncluster-basedtechnologiesthatprovidefault
toleranceoutofthebox.
•Loweroperationalcosts–ManyNoSQLdatabasesarebuiltonOpenSource
platformswithnolicensingcosts.Theycanoftenbedeployedoncommodity
hardware.
•Eventualconsistency–Datareadsacrossmultiplenodesbutmaynotbeconsistent
immediatelyafterawrite.However,allnodeswilleventuallybeinaconsistentstate.
•BASE,notACID–BASEcompliancerequiresadatabasetomaintainhigh
availabilityintheeventofnetwork/nodefailure,whilenotrequiringthedatabaseto
beinaconsistentstatewheneveranupdateoccurs.Thedatabasecanbeina
soft/inconsistentstateuntiliteventuallyattainsconsistency.Asaresult,in
considerationoftheCAPtheorem,NoSQLstoragedevicesaregenerallyAPorCP.
•APIdrivendataaccess–DataaccessisgenerallysupportedviaAPIbasedqueries,
includingRESTfulAPIs,whereassomeimplementationsmayalsoprovideSQL-like
querycapability.
•Autoshardingandreplication–Tosupporthorizontalscalingandprovidehigh
availability,aNoSQLstoragedeviceautomaticallyemploysshardingandreplication
techniqueswherethedatasetispartitionedhorizontallyandthencopiedtomultiple
nodes.
•Integratedcaching–Thisremovestheneedforathird-partydistributedcaching
layer,suchasMemcached.
•Distributedquerysupport–NoSQLstoragedevicesmaintainconsistentquery
behavioracrossmultipleshards.
•Polyglotpersistence–TheuseofNoSQLstoragedoesnotmandateretiring
traditionalRDBMSs.Infact,bothcanbeusedatthesametime,therebysupporting

polyglotpersistence,whichisanapproachofpersistingdatausingdifferenttypesof
storagetechnologieswithinthesamesolutionarchitecture.Thisisgoodfor
developingsystemsrequiringstructuredaswellassemi/unstructureddata.
•Aggregate-focused–Unlikerelationaldatabasesthataremosteffectivewithfully
normalizeddata,NoSQLstoragedevicesstorede-normalizedaggregateddata(an
entitycontainingmerged,oftennested,dataforanobject)therebyeliminatingthe
needforjoinsandextensivemappingbetweenapplicationobjectsandthedata
storedinthedatabase.Oneexception,however,isthatgraphdatabasestorage
devices(introducedshortly)arenotaggregate-focused.
Rationale
TheemergenceofNoSQLstoragedevicescanprimarilybeattributedtothevolume,
velocityandvarietycharacteristicsofBigDatadatasets.
Volume
Thestoragerequirementofeverincreasingdatavolumescommandstheuseofdatabases
thatarehighlyscalablewhilekeepingcostsdownforthebusinesstoremaincompetitive.
NoSQLstoragedevicesfulfillthisrequirementbyprovidingscaleoutcapabilitywhile
usinginexpensivecommodityservers.
Velocity
Thefastinfluxofdatarequiresdatabaseswithfastaccessdatawritecapability.NoSQL
storagedevicesenablefastwritesbyusingschema-on-readratherthanschema-on-write
principle.Beinghighlyavailable,NoSQLstoragedevicescanensurethatwritelatency
doesnotoccurbecauseofnodeornetworkfailure.
Variety
Astoragedeviceneedstohandledifferentdataformatsincludingdocuments,emails,
imagesandvideosandincompletedata.NoSQLstoragedevicescanstorethesedifferent
formsofsemi-structuredandunstructureddataformats.Atthesametime,NoSQLstorage
devicesareabletostoreschema-lessdataandincompletedatawiththeaddedabilityof
makingschemachangesasthedatamodelofthedatasetsevolve.Inotherwords,NoSQL
databasessupportschemaevolution.
Types
NoSQLstoragedevicescanmainlybedividedintofourtypesbasedonthewaytheystore
data,asshowninFigures7.7–7.10:
•key-value
•document
•column-family
•graph

Figure7.7Anexampleofkey-valueNoSQLstorage.

Figure7.8AnexampleofdocumentNoSQLstorage.

Figure7.9Anexampleofcolumn-familyNoSQLstorage.

Figure7.10AnexampleofgraphNoSQLstorage.
Key-Value
Key-valuestoragedevicesstoredataaskey-valuepairsandactlikehashtables.Thetable
isalistofvalueswhereeachvalueisidentifiedbyakey.Thevalueisopaquetothe
databaseandistypicallystoredasaBLOB.Thevaluestoredcanbeanyaggregate,
rangingfromsensordatatovideos.
Valuelook-upcanonlybeperformedviathekeysasthedatabaseisoblivioustothe
detailsofthestoredaggregate.Partialupdatesarenotpossible.Anupdateiseitheradelete
oraninsertoperation.
Key-valuestoragedevicesgenerallydonotmaintainanyindexes,thereforewritesare
quitefast.Basedonasimplestoragemodel,key-valuestoragedevicesarehighlyscalable.
Askeysaretheonlymeansofretrievingthedata,thekeyisusuallyappendedwiththe
typeofthevaluebeingsavedforeasyretrieval.Anexampleofthisis123_sensor1.
Toprovidesomestructuretothestoreddata,mostkey-valuestoragedevicesprovide
collectionsorbuckets(liketables)intowhichkey-valuepairscanbeorganized.Asingle
collectioncanholdmultipledataformats,asshowninFigure7.11.Someimplementations
supportcompressingvaluesforreducingthestoragefootprint.However,thisintroduces
latencyatreadtime,asthedataneedstobedecompressedfirstbeforebeingreturned.

Figure7.11Anexampleofdataorganizedintokey-valuepairs.
Akey-valuestoragedeviceisappropriatewhen:
•unstructureddatastorageisrequired
•highperformanceread/writesarerequired

•thevalueisfullyidentifiableviathekeyalone
•valueisastandaloneentitythatisnotdependentonothervalues
•valueshaveacomparativelysimplestructureorarebinary
•querypatternsaresimple,involvinginsert,selectanddeleteoperationsonly
•storedvaluesaremanipulatedattheapplicationlayer
Akey-valuestoragedeviceisinappropriatewhen:
•applicationsrequiresearchingorfilteringdatausingattributesofthestoredvalue
•relationshipsexistbetweendifferentkey-valueentries
•agroupofkeys’valuesneedtobeupdatedinasingletransaction
•multiplekeysrequiremanipulationinasingleoperation
•schemaconsistencyacrossdifferentvaluesisrequired
•updatetoindividualattributesofthevalueisrequired
Examplesofkey-valuestoragedevicesincludeRiak,Redis,andAmazonDynamoDB.
Document
Documentstoragedevicesalsostoredataaskey-valuepairs.However,unlikekey-value
storagedevices,thestoredvalueisadocumentthatcanbequeriedbythedatabase.These
documentscanhaveacomplexnestedstructure,suchasaninvoice,asshowninFigure
7.12.Thedocumentscanbeencodedusingeitheratext-basedencodingscheme,suchas
XMLorJSON,orusingabinaryencodingscheme,suchasBSON(BinaryJSON).

Figure7.12AdepictionofJSONdatastoredinadocumentstoragedevice.
Likekey-valuestoragedevices,mostdocumentstoragedevicesprovidecollectionsor
buckets(liketables)intowhichkey-valuepairscanbeorganized.Themaindifferences
betweendocumentstoragedevicesandkey-valuestoragedevicesareasfollows:
•documentstoragedevicesarevalue-aware
•thestoredvalueisself-describing;theschemacanbeinferredfromthestructureof
thevalueorareferencetotheschemaforthedocumentisincludedinthevalue
•aselectoperationcanreferenceafieldinsidetheaggregatevalue
•aselectoperationcanretrieveapartoftheaggregatevalue
•partialupdatesaresupported;thereforeasubsetoftheaggregatecanbeupdated

•indexesthatspeedupsearchesaregenerallysupported
Eachdocumentcanhaveadifferentschema;therefore,itispossibletostoredifferent
typesofdocumentsinthesamecollectionorbucket.Additionalfieldscanbeaddedtoa
documentaftertheinitialinsert,therebyprovidingflexibleschemasupport.
Itshouldbenotedthatdocumentstoragedevicesarenotlimitedtostoringdatathatoccurs
intheformofactualdocuments,suchasanXMLfile,buttheycanalsobeusedtostore
anyaggregatethatconsistsofacollectionoffieldshavingaflatoranestedschema.See
Figure7.12,whichshowsJSONdocumentsbeingstoredinadocumentNoSQLdatabase.
Adocumentstoragedeviceisappropriatewhen:
•storingsemi-structureddocument-orienteddatacomprisingflatornestedschema
•schemaevolutionisarequirementasthestructureofthedocumentiseither
unknownorislikelytochange
•applicationsrequireapartialupdateoftheaggregatestoredasadocument
•searchesneedtobeperformedondifferentfieldsofthedocuments
•storingdomainobjects,suchascustomers,inserializedobjectform
•querypatternsinvolveinsert,select,updateanddeleteoperations
Adocumentstoragedeviceisinappropriatewhen:
•multipledocumentsneedtobeupdatedaspartofasingletransaction
•performingoperationsthatneedjoinsbetweenmultipledocumentsorstoringdata
thatisnormalized
•schemaenforcementforachievingconsistentquerydesignisrequiredasthe
documentstructuremaychangebetweensuccessivequeryruns,whichwillrequire
restructuringthequery
•thestoredvalueisnotself-describinganddoesnothaveareferencetoaschema
•binarydataneedstobestored
ExamplesofdocumentstoragedevicesincludeMongoDB,CouchDB,andTerrastore.
Column-Family
Column-familystoragedevicesstoredatamuchlikeatraditionalRDBMSbutgroup
relatedcolumnstogetherinarow,resultingincolumn-families(Figure7.13).Each
columncanbeacollectionofrelatedcolumnsitself,referredtoasasuper-column.