Tải bản đầy đủ - 0 (trang)
Hour 1. Introduction of Big Data, NoSQL, and Business Value Proposition

Hour 1. Introduction of Big Data, NoSQL, and Business Value Proposition

Tải bản đầy đủ - 0trang

forecasts.

Thecapabilitytocollectavastamountofdatafromdifferentsourcesenablesan

organizationtogainacompetitiveadvantage.Acompanycanthenbetterpositionitselfor

itsproductsandservicesinamorefavorablemarket(whereandhow)toreachtargeted

customers(who)attheirmostreceptivetimes(when),andthenlistentoitscustomersfor

suggestions(feedbackandcustomerservice).Moreimportant,acompanycanultimately

offersomethingthatmakessensetocustomers(what).

Analyticsessentiallyenablesorganizationstocarryouttargetedcampaigns,cross-sales

recommendations,onlineadvertising,andmore.Butbeforeyoustartyourjourneyintothe

worldofBigData,NoSQL,andbusinessanalytics,youneedtoknowthetypesofanalysis

anorganizationgenerallyconducts.

Companiesperformthreebasictypesofanalysisoncollecteddata(seeFigure1.1):

Diagnosticordescriptiveanalysis—Organizationsseektounderstandwhat

happenedoveracertainperiodoftimeanddeterminewhatcausedittohappen.

Theymighttrytogaininsightintohistoricaldatawithreporting,KeyPerformance

Indicators(KPIs),andscorecards.Forexample,thistypeofanalysiscanuse

clusteringorclassificationtechniquesforcustomersegmentationtobetter

understandcustomersandofferthemproductsbasedontheirneedsand

requirements.



FIGURE1.1BusinessintelligenceandBigDataanalysistypes.

Predictiveanalysis—Predictiveanalysishelpsanorganizationunderstandwhatcan

happeninthefuturebasedonidentifiedpatternsinthedata,usingstatisticaland

machinelearningtechniques.Predictiveanalysisisalsoreferredtoasdataminingor

machinelearning.Thistypeofanalysisusestimeseries,neuralnetworks,and

regressionalgorithmstopredictthefuture.Predictiveanalysisenablescompaniesto

answerthesetypesofquestions:

Whichstocksshouldwetargetaspartofourportfoliomanagement?

Didsomestocksshowhaphazardbehavior?Whichfactorsareimpactingthestock

gainsthemost?

Howandwhyareusersofe-commerceplatforms,onlinegames,andweb

applicationsbehavinginaparticularway?

Howdoweoptimizetheroutingofourfleetofvehiclesbasedonweatherand

trafficpatterns?



Howdowebetterpredictfutureoutcomesbasedonanidentifiedpattern?

Prescriptiveanalysis—Someresearchersrefertothisanalysisasthefinalphasein

businessanalytics.Organizationscanpredictthelikelyoutcomeofvarious

correctivemeasuresusingoptimizationandsimulationtechniques.Forexample,

prescriptiveanalysiscanuselinearprogramming,MonteCarlosimulation,orgame

theoryforchannelmanagementorportfoliooptimization.



TypesofData

Businessesarelargelyinterestedinthreebroadtypesofdata:structured,unstructured,and

semi-structureddata.



StructuredData

Structureddataadherestothepredefinedfixedschemaandstrictdatamodelstructure—

thinkofatableintherelationaldatabasesystem.Arowinthetablealwayshasthesame

numberofcolumnsofthesametypeofotherrows(althoughsomecolumnsmightcontain

blankorNULLvalues),perthepredefinedschemaofthetable.Withstructureddata,

changestotheschemaareassumedtoberareand,hence,thedatamodelisrigid.



UnstructuredData

Unlikestructureddata,unstructureddatahasnoidentifiableinternalstructure.Itdoesnot

haveapredefined,fixedschema,butinsteadhasafree-formstructure.Unstructureddata

includesproprietarydocuments,bitmapimagesandobjects,text,andotherdatatypesthat

arenotpartofadatabasesystem.Examplesincludephotosandgraphicimages,audioand

video,streaminginstrumentdata,webpages,emails,blogentries,wikis,portable

documentformat(PDF)documents,WordorExceldocuments,andPowerPoint

presentations.Unstructureddataconstitutesmostenterprisedatatoday.

InExceldocuments,forexample,thecontentmightcontaindatainstructuredtabular

format,buttheExceldocumentitselfisconsideredunstructureddata.Likewise,email

messagesareorganizedontheemailserverinastructuredformatinthedatabasesystem,

butthebodyofthemessagehasafree-formstructurewithnostructure.



Semi-StructuredData

Semi-structureddataisahybridbetweenstructuredandunstructureddata.Itusually

containsdatainstructuredformatbutaschemathatisnotpredefinedandnotrigid.Unlike

structureddata,semi-structureddatalacksthestrictdatamodelstructure.Examplesare

ExtensibleMarkupLanguage(XML)orJavaScriptObjectNotation(JSON)documents,

whichcontaintags(elementsorattributes)toidentifyspecificelementswithinthedata,

butwithoutarigidstructuretoadhereto.

Unlikeinarelationaltable,inwhicheachrowhasthesamenumberofcolumns,each

entityinsemi-structureddata(analogoustoarowinarelationaltable)hasadifferent

numberofattributesorevennestedentities.



Note:BytheWay

Forsimplicity,weuse“structuredandunstructureddata”torefertothe

collectionofstructured,semi-structured,andunstructureddata.Semistructureddatausuallyisgroupedwithunstructureddataeventhoughit

differsslightlyfrompurelyunstructureddata.



BigData

Thephrasedataexplosionreferstothevastamountofdata(structured,semi-structured,

andunstructured)organizationsgenerateeveryday,bothinternallyandexternally,ata

speedthatispracticallyimpossiblefortheircurrentdataprocessingsystemstocollectand

process.Ironically,organizationscannotaffordtoignorethedatabecauseitprovides

insightintohowtheycangaincompetitiveadvantages.Insomecases,organizationsare

requiredtostorelargeamountsofstructuredandunstructureddata(documents,email

messages,chathistory,audio,video,andotherformsofelectroniccommunication)to

complywithgovernmentregulations.Fortunately,thecostofstoragedeviceshas

decreasedsignificantly,enablingcompaniestostoreBigDatathattheypreviouslywould

havepurgedregularly.

BigDataisprimarilycharacterizedbythethreeVs(seeFigure1.2):

Volume

Variety

Velocity



FIGURE1.2BigDatacharacteristics.

Businessescurrentlycannotcapture,manage,andprocessthethreeVsusingtraditional

dataprocessingsystemswithinatolerableelapsedtime.



VolumeCharacteristicsofBigData

Bigdatacanbestoredinvolumesofterabytes,petabytes,andevenbeyond.Nowthe

focusisnotonlyhuman-generateddata(mostlystructured,asasmallpercentageof

overalldata),butalsodatageneratedbymachinessuchassensors,connecteddevices,and

Radio-FrequencyIdentification(RFID)devices(mostlyunstructureddata,asalarger

percentageoverall).(SeeFigure1.3.)



FIGURE1.3VolumecharacteristicsofBigData.



VarietyCharacteristicsofBigData

Varietyreferstothemanagementofstructured,semi-structured,andunstructureddata(see

Figure1.4).Semi-structuredandunstructureddataincludesbutisnotlimitedtotext,

images,legacydocuments,audio,video,PDFs,clickstreamdata,weblogdata,anddata

gatheredfromsocialmedia.Mostofthisunstructureddataisgeneratedfromsensors,

connecteddevices,clickstream,andweblogs,andcanconstituteupto80percentof

overallBigData.



FIGURE1.4VarietycharacteristicofBigData.



VelocityCharacteristicsofBigData

Velocityreferstothepaceatwhichdataarrivesandusuallyreferstoareal-timeornearreal-timestreamofdata(seeFigure1.5).Examplesincludetradingandstockexchange

dataandsensorsattachedtoproductionlinemachinerytocontinuouslymonitorstatus.



FIGURE1.5VelocitycharacteristicofBigData.



ForBigData,velocityalsoreferstotherequiredspeedofdatainsight.

Recently,someauthorsandresearchershaveaddedanotherVtodefinethecharacteristic

ofBigData:variability.Thischaracteristicreferstothemanypossibleinterpretationsof

thesamedata.Similarly,veracitydefinestheuncertainty(credibilityofthesourceofdata

mightnotbeverifiableandhencesuitabilityofthedatafortargetaudiencemightbe

questionable)incollecteddata.Nonetheless,thepremiseofBigDataremainsthesameas

discussedearlier.

BigDataisgenerallysynonymouswithHadoop,butthetwoarenotreallythesame.Big

Datareferstoahumongousvolumeofdifferenttypesofdatawiththecharacteristicsof

volume,variety,andvelocitythatarrivesataveryfastpace.Hadoop,ontheotherhand,is

oneofthetoolsortechnologiesusedtostore,manage,andprocessBigData.

GOTO WetalkingreaterdetailaboutHadoopanditsarchitectureinHour2,

“IntroductiontoHadoop,ItsArchitecture,Ecosystem,andMicrosoft

Offerings.”



WhatBigDataIsNot

BigDatadoesnotrefertothetoolsandtechnologiesthatmanageandprocesstheBig

Data(asdiscussedearlier)itself.SeveraltoolsandtechnologiescanmanageBigData,and

Hadoopisoneamongthem.Hadoopisamature,fault-tolerantplatformthatcanhandle

thedistributedstorageandprocessingofBigData.

GOTO WetalkaboutHadoopingreaterdetailinHour2.

Sofar,wehavetalkedaboutBigDataandlookedatfuturetrendsinanalyticsonBigData.

Nowlet’sdivedeepertounderstandthedifferenttoolsandtechnologiesusedtostore,

manage,andprocessBigData.



ManagingBigData

Anorganizationcannotaffordtodeletedata(especiallyBigData)ifitwantsto

outperformitscompetitors.TappingintotheopportunitiesBigDataoffersmakesgood

businesssenseforsomekeyreasons.



MoreData,MoreAccurateModels

Asubstantialnumberandvarietyofdatasourcesgeneratelargequantitiesofdatafor

businesses.Theseincludeconnecteddevices,sensors,RFIDs,webclicks,andweblogs

(seeFigure1.6).Organizationsnowrealizethatdataistoovaluabletodelete,sotheyneed

tostore,manage,andprocessthatdata.



FIGURE1.6GettinginsightintoBigData.



More—andCheaper—ComputingPowerandStorage

Thedramaticdeclineinthecostofcomputinghardwareresources(seeFigure1.7),

especiallythecostofstoragedevices,isonefactorthatenablesorganizationstostore

everybitofdataorBigData.Italsoenableslargeorganizationstocost-effectivelyretain

largeamountsofstructuredandunstructureddatalonger,tocomplywithgovernment

regulationsandguardagainstfuturelitigation.



FIGURE1.7Decreasinghardwareprices.



IncreasedAwarenessoftheCompetitionandaMeanstoProactivelyWin

OverCompetitors

Companieswanttoleverageallpossiblemeanstoremaincompetitiveandbeattheir

competitors.Evenwiththeadventofsocialmedia,abusinessneedstoanalyzedatato

understandcustomersentimentabouttheorganizationanditsproductsorservices.

Companiesalsowanttooffercustomerswhattheywantthroughtargetedcampaignsand

seektounderstandreasonsforcustomerchurn(therateofattritioninthecustomerbase)

sothattheycantakeproactivemeasurestoretaincustomers.Figure1.8showsincreased

awarenessandcustomerdemands.



FIGURE1.8Increasedawareness,realization,anddemand.



AvailabilityofNewToolsandTechnologiestoProcessandManageBig

Data

Severalnewtoolsandtechnologiescanhelpcompaniesstore,manage,andprocessBig

Data.TheseincludeHadoop,MongoDB,CouchDB,DocumentDB,andCassandra,among

others.WecoverHadoopanditsarchitectureinmoredetailinHour2.



NoSQLSystems

IfyouareaStructuredQueryLanguage(SQL)orRelationalDatabaseManagement

System(RDBMS)expert,youfirstmustknowthatyoudon’tneedtoworryaboutNoSQL

—thesetwotechnologiesserveaverydifferentpurpose.NoSQLisnotareplacementof

thefamiliarSQLorRDBMStechnologies,although,ofcourse,learningthesenewtools

andtechnologieswillgiveyoubetterperspectiveandhelpyouthinkofanorganizational

probleminaholisticmanner.SowhydoweneedNoSQL?Thesheervolume,velocity,

andvarietyofBigDataarebeyondthecapabilitiesofRDBMStechnologiestoprocessin

atimelymanner.NoSQLtoolsandtechnologiesareessentialforprocessingBigData.

NoSQLstandsforNotOnlySQLandiscomplimentarytotheexistingSQLorRDBMS

technologies.Forsomeproblems,storageandprocessingsolutionsotherthanRDBMSare

moresuitable—bothtechnologiescancoexist,andeachhasitsownplace.RDBMSstill

dominatesthemarket,butNoSQLtechnologiesarecatchinguptomanageBigDataand

real-timewebapplications.

Inmanyscenarios,bothtechnologiesarebeingusedtoprovideanenterprise-wide

businessintelligenceandbusinessanalyticssystems.Intheseintegratedsystems,NoSQL

systemsstoreandmanageBigData(withnoschema)andRDBMSstorestheprocessed

datainrelationalformat(withschema)foraquickerqueryresponsetime.



NoSQLVersusRDBMS

RDBMSsystemsarealsocalledschema-firstbecauseRDBMSsupportscreatinga

relation,oratablestructuretostoredatainrowsandcolumns(apredefinednormalized

structure)andthenjointhemusingarelationshipbetweentheprimarykeyandaforeign

key.Datagetsstoredintheserelations/tables.Whenquerying,wethenretrievedataeither

fromasinglerelationorfrommultiplerelationsbyjoiningthem.AnRDBMSsystem

providesafasterqueryresponsetime,butloadingdataintoittakeslonger;asignificant

amountoftimeisneededespeciallywhenyouaredevelopinganddefiningaschema.The

rigidschemarequirementmakesitinflexible—changingtheschemalaterrequiresa

significantamountofeffortandtime.AsyoucanseeinFigure1.9,onceyouhaveadata

modelinplace,youmuststorethedatainstages,applycleansingandtransformation,and

thenmovethefinalsetofdatatothedatawarehouseforanalysis.Thisoverallprocessof

loadingdataintoanRDBMSsystemisnotsuitableforBigData.Figure1.9showsthe

stagesinanalysisofstructureddatainRDBMS—RelationDataWarehouse.



FIGURE1.9StagesintheanalysisofstructureddatainRDBMS—RelationData

Warehouse.

IncontrasttoRDBMSsystems,NoSQLsystemsarecalledschema-laterbecausethey

don’thavethestrictrequirementofdefiningtheschemaorstructureofdatabeforethe

actualdataloadprocessbegins.Forexample,youcancontinuetostoredatainaHadoop

clusterasitarrivesinHadoopDistributedFileSystem(HDFS;youlearnmoreaboutitin

Hour3,“HadoopDistributedFileSystemVersions1.0and2.0”)(infilesandfolders),and

thenlateryoucanuseHivetodefinetheschemaforqueryingdatafromthefolders.

Likewise,otherdocument-orientedNoSQLsystemssupportstoringdataindocuments

usingtheflexibleJSONformat.Thisenablestheapplicationtostorevirtuallyany

structureitwantsinadataelementinaJSONdocument.AJSONdocumentmighthave

allthedatastoredinarowthatspansseveraltablesofarelationaldatabaseandmight

aggregateitintoasingledocument.Consolidatingdataindocumentsthiswaymight

duplicateinformation,butthelowercostofstoragemakesitpossible.Asyoucanseein

Figure1.10,NoSQLletsyoucontinuetostoredataasitarrives,withoutworryingabout

theschemaorstructureofthedata,andthenlateruseanapplicationprogramtoquerythe

data.Figure1.10showsthestagesofanalyzingBigDatainNoSQLsystems.



FIGURE1.10StagesinanalysisofBigDatainNoSQLsystems.

ApartfromefficiencyinthedataloadprocessforBigData,RDBMSsystemsandNoSQL

systemshaveotherdifferences(seeFigure1.11).



FIGURE1.11DifferencesbetweenRDBMSandNoSQLsystems.



MajorTypesofNoSQLTechnologies

SeveralNoSQLsystemsareused.Forclarity,wehavedividedthemintothetypicalusage

scenarios(forexample,OnlineTransactionProcessing[OLTP]orOnlineAnalytical

Processing[OLAP])weoftendealwith.

NocurrentNoSQLsystempurelysupportstheneedforOLTP;theyalllackacouple

importantsupports.ThissectioncoversthefollowingfourcategoriesofNoSQLsystems

usedwithOLTP:

Key-valuestoredatabases



Columnar,orcolumn-oriented,orcolumn-storedatabases

Document-orienteddatabases

Graphdatabases

GOTO FormoreinformationonsupportsforOLTP,refertothe“Limitations

ofNoSQLSystems”section,laterinthishour.

Key-ValueStoreDatabases

Key-valuestoredatabasesstoredataasacollectionofkey-valuepairsinawaythateach

possiblekeyappearsonce,atmost,inacollection.Thisissimilartothehashtablesofthe

programmingworld,withauniquekeyandapointertoaparticularitemofdata.This

databasestoresonlypairsofkeysandvalues,anditfacilitatesretrievingvalueswhena

keyisknown.Thesemappingsareusuallyaccompaniedbycachemechanisms,to

maximizeperformance.Key-valuestoresareprobablythesimplesttypeandnormallydo

notfitforallBigDataproblems.Key-valuestoredatabasesareidealforstoringwebuser

profiles,sessioninformation,andshoppingcarts.Theyarenotidealifadatarelationship

iscriticaloratransactionspanskeys.

Afilesystemcanbeconsideredakey-valuestore,withthefilepath/nameasthekeyand

theactualfilecontentasthevalue.Figure1.12showsanexampleofakey-valuestore.



FIGURE1.12Key-valuestoredatabasestoragestructure.

Inanotherexample,withphone-relateddata,"PhoneNumber"isconsideredthekey,

withassociatedvaluessuchas"(123)111-12345".

Dozensofkey-valuestoredatabasesareinuse,includingAmazonDynamo,Microsoft

AzureTablestorage,Riak,Redis,andMemCached.

AmazonDynamo

AmazonDynamowasdevelopedasaninternaltechnologyatAmazonforitse-commerce

businesses,toaddresstheneedforanincrementallyscalable,highlyavailablekey-value

storagesystem.Itisoneofthemostprominentkey-valuestoreNoSQLdatabases.

AmazonS3usesDynamoasitsstoragemechanism.Thetechnologyhasbeendesignedto

enableuserstotradeoffcost,consistency,durability,andperformancewhilemaintaining

highavailability.



MicrosoftAzureTableStorage

MicrosoftAzureTablestorageisanotherexampleofakey-valuestorethatallowsfor

rapiddevelopmentandfastaccesstolargequantitiesofdata.Itoffershighlyavailable,

massivelyscalablekey-value–basedstoragesothatanapplicationcanautomaticallyscale

tomeetuserdemand.InMicrosoftAzureTable,key-valuepairsarecalledPropertiesand

areusefulinfilteringandspecifyingselectioncriteria;theybelongtoEntities,which,in

turn,areorganizedintoTables.MicrosoftAzureTablefeaturesoptimisticconcurrency

and,aswithotherNoSQLdatabases,isschema-less.Thepropertiesofeachentityina

specifictablecandiffer,meaningthattwoentitiesinthesametablecancontaindifferent

collectionsofproperties,andthosepropertiescanbeofdifferenttypes.

ColumnarorColumn-OrientedorColumn-StoreDatabases

Unlikearow-storedatabasesystem,whichstoresdatafromallthecolumnsofarow

storedtogether,acolumn-orienteddatabasestoresthedatafromasinglecolumntogether.

Youmightbewonderinghowadifferentphysicallayoutrepresentationofthesamedata

(storingthesamedatainacolumnarformatinsteadofthetraditionalrowformat)can

improveflexibilityandperformance.

Inacolumn-orienteddatabase,theflexibilitycomesfromthefactthataddingacolumnis

botheasyandinexpensive,withcolumnsappliedonarow-by-rowbasis.Eachrowcan

haveadifferentsetofcolumns,makingthetablesparse.Inaddition,becausethedatafrom

singlecolumnsisstoredtogether,thedatabasehashighredundancyandachievesagreater

degreeofcompression,improvingtheoverallperformance.

Column-orientedorcolumn-storedatabasesareidealforsitesearches,blogs,content

managementsystems,andcounteranalytics.Figure1.13showsthedifferencebetweena

rowstoreandacolumnstore.



FIGURE1.13Row-storeversuscolumn-storestoragestructure.

SomeRDBMSsystemshavebeguntosupportstoringdatainacolumn-orientedstructure,

suchasSQLServer2012andonward.ThefollowingNoSQLdatabasesalsosupport



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hour 1. Introduction of Big Data, NoSQL, and Business Value Proposition

Tải bản đầy đủ ngay(0 tr)

×
x