Tải bản đầy đủ
Chapter 3. Big Data Adoption and Planning Considerations

Chapter 3. Big Data Adoption and Planning Considerations

Tải bản đầy đủ

oversightcanturnabestintentionedprojectintoascienceexperimentthatneverdelivers
promisedresults.ItisagainstthisbackdropthatChapter3addressesBigDataadoption
andplanningconsiderations.
GiventhenatureofBigDataanditsanalyticpower,therearemanyissuesthatneedtobe
consideredandplannedforinthebeginning.Forexample,withtheadoptionofanynew
technology,themeanstosecureitinawaythatconformstoexistingcorporatestandards
needstobeaddressed.Issuesrelatedtotrackingtheprovenanceofadatasetfromits
procurementtoitsutilizationisoftenanewrequirementfororganizations.Managingthe
privacyofconstituentswhosedataisbeinghandledorwhoseidentityisrevealedby
analyticprocessesmustbeplannedfor.BigDataevenopensupadditionalopportunitiesto
considermovingbeyondon-premiseenvironmentsandintoremotely-provisioned,
scalableenvironmentsthatarehostedinacloud.Infact,alloftheaboveconsiderations
requireanorganizationtorecognizeandestablishasetofdistinctgovernanceprocesses
anddecisionframeworkstoensurethatresponsiblepartiesunderstandBigData’snature,
implicationsandmanagementrequirements.
Organizationally,theadoptionofBigDatachangestheapproachtoperformingbusiness
analytics.Forthisreason,aBigDataanalyticslifecycleisintroducedinthischapter.The
lifecyclebeginswiththeestablishmentofabusinesscasefortheBigDataprojectand
endswithensuringthattheanalyticresultsaredeployedtotheorganizationtogenerate
maximalvalue.Thereareanumberofstagesinbetweenthatorganizethestepsof
identifying,procuring,filtering,extracting,cleansingandaggregatingofdata.Thisisall
requiredbeforetheanalysisevenoccurs.Theexecutionofthislifecyclerequiresnew
competenciestobedevelopedorhiredintotheorganization.
Asdemonstrated,therearemanythingstoconsiderandaccountforwhenadoptingBig
Data.Thischapterexplainstheprimarypotentialissuesandconsiderations.

OrganizationPrerequisites
BigDataframeworksarenotturn-keysolutions.Inorderfordataanalysisandanalyticsto
offervalue,enterprisesneedtohavedatamanagementandBigDatagovernance
frameworks.Soundprocessesandsufficientskillsetsforthosewhowillberesponsiblefor
implementing,customizing,populatingandusingBigDatasolutionsarealsonecessary.
Additionally,thequalityofthedatatargetedforprocessingbyBigDatasolutionsneedsto
beassessed.
Outdated,invalid,orpoorlyidentifieddatawillresultinlow-qualityinputwhich,
regardlessofhowgoodtheBigDatasolutionis,willcontinuetoproducelow-quality
results.ThelongevityoftheBigDataenvironmentalsoneedstobeplannedfor.A
roadmapneedstobedefinedtoensurethatanynecessaryexpansionoraugmentationof
theenvironmentisplannedouttostayinsyncwiththerequirementsoftheenterprise.

DataProcurement
TheacquisitionofBigDatasolutionsthemselvescanbeeconomical,duetothe
availabilityofopen-sourceplatformsandtoolsandopportunitiestoleveragecommodity
hardware.However,asubstantialbudgetmaystillberequiredtoobtainexternaldata.The
natureofthebusinessmaymakeexternaldataveryvaluable.Thegreaterthevolumeand
varietyofdatathatcanbesupplied,thehigherthechancesareoffindinghiddeninsights
frompatterns.
Externaldatasourcesincludegovernmentdatasourcesandcommercialdatamarkets.
Government-provideddata,suchasgeo-spatialdata,maybefree.However,most
commerciallyrelevantdatawillneedtobepurchasedandmayinvolvethecontinuationof
subscriptioncoststoensurethedeliveryofupdatestoprocureddatasets.

Privacy
Performinganalyticsondatasetscanrevealconfidentialinformationaboutorganizations
orindividuals.Evenanalyzingseparatedatasetsthatcontainseeminglybenigndatacan
revealprivateinformationwhenthedatasetsareanalyzedjointly.Thiscanleadto
intentionalorinadvertentbreachesofprivacy.
Addressingtheseprivacyconcernsrequiresanunderstandingofthenatureofdatabeing
accumulatedandrelevantdataprivacyregulations,aswellasspecialtechniquesfordata
taggingandanonymization.Forexample,telemetrydata,suchasacar’sGPSlogorsmart
meterdatareadings,collectedoveranextendedperiodoftimecanrevealanindividual’s
locationandbehavior,asshowninFigure3.1.

Figure3.1Informationgatheredfromrunninganalyticsonimagefiles,relationaldata
andtextualdataisusedtocreateJohn’sprofile.

Security
SomeofthecomponentsofBigDatasolutionslacktherobustnessoftraditionalenterprise
solutionenvironmentswhenitcomestoaccesscontrolanddatasecurity.SecuringBig
Datainvolvesensuringthatthedatanetworksandrepositoriesaresufficientlysecuredvia
authenticationandauthorizationmechanisms.
BigDatasecurityfurtherinvolvesestablishingdataaccesslevelsfordifferentcategories
ofusers.Forexample,unliketraditionalrelationaldatabasemanagementsystems,NoSQL
databasesgenerallydonotproviderobustbuilt-insecuritymechanisms.Theyinsteadrely
onsimpleHTTP-basedAPIswheredataisexchangedinplaintext,makingthedataprone
tonetwork-basedattacks,asshowninFigure3.2.

Figure3.2NoSQLdatabasescanbesusceptibletonetwork-basedattacks.

Provenance
Provenancereferstoinformationaboutthesourceofthedataandhowithasbeen
processed.Provenanceinformationhelpsdeterminetheauthenticityandqualityofdata,
anditcanbeusedforauditingpurposes.Maintainingprovenanceaslargevolumesofdata
areacquired,combinedandputthroughmultipleprocessingstagescanbeacomplextask.
Atdifferentstagesintheanalyticslifecycle,datawillbeindifferentstatesduetothefact
itmaybebeingtransmitted,processedorinstorage.Thesestatescorrespondtothenotion
ofdata-in-motion,data-in-useanddata-at-rest.Importantly,wheneverBigDatachanges
state,itshouldtriggerthecaptureofprovenanceinformationthatisrecordedasmetadata.
Asdataenterstheanalyticenvironment,itsprovenancerecordcanbeinitializedwiththe
recordingofinformationthatcapturesthepedigreeofthedata.Ultimately,thegoalof
capturingprovenanceistobeabletoreasonoverthegeneratedanalyticresultswiththe
knowledgeoftheoriginofthedataandwhatstepsoralgorithmswereusedtoprocessthe
datathatledtotheresult.Provenanceinformationisessentialtobeingabletorealizethe
valueoftheanalyticresult.Muchlikescientificresearch,ifresultscannotbejustifiedand
repeated,theylackcredibility.Whenprovenanceinformationiscapturedonthewayto
generatinganalyticresultsasinFigure3.3,theresultscanbemoreeasilytrustedand
therebyusedwithconfidence.

Figure3.3Datamayalsoneedtobeannotatedwithsourcedatasetattributesand
processingstepdetailsasitpassesthroughthedatatransformationsteps.

LimitedRealtimeSupport
Dashboardsandotherapplicationsthatrequirestreamingdataandalertsoftendemand
realtimeornear-realtimedatatransmissions.ManyopensourceBigDatasolutionsand
toolsarebatch-oriented;however,thereisanewgenerationofrealtimecapableopen
sourcetoolsthathavesupportforstreamingdataanalysis.Manyoftherealtimedata
analysissolutionsthatdoexistareproprietary.Approachesthatachievenear-realtime
resultsoftenprocesstransactionaldataasitarrivesandcombineitwithpreviously
summarizedbatch-processeddata.

DistinctPerformanceChallenges
DuetothevolumesofdatathatsomeBigDatasolutionsarerequiredtoprocess,
performanceisoftenaconcern.Forexample,largedatasetscoupledwithcomplexsearch
algorithmscanleadtolongquerytimes.Anotherperformancechallengeisrelatedto
networkbandwidth.Withincreasingdatavolumes,thetimetotransferaunitofdatacan
exceeditsactualdataprocessingtime,asshowninFigure3.4.

Figure3.4Transferring1PBofdataviaa1-GigabitLANconnectionat80%
throughputwilltakeapproximately2,750hours.

DistinctGovernanceRequirements
BigDatasolutionsaccessdataandgeneratedata,allofwhichbecomeassetsofthe
business.Agovernanceframeworkisrequiredtoensurethatthedataandthesolution
environmentitselfareregulated,standardizedandevolvedinacontrolledmanner.
ExamplesofwhataBigDatagovernanceframeworkcanencompassinclude:
•standardizationofhowdataistaggedandthemetadatausedfortagging
•policiesthatregulatethekindofexternaldatathatmaybeacquired
•policiesregardingthemanagementofdataprivacyanddataanonymization
•policiesforthearchivingofdatasourcesandanalysisresults
•policiesthatestablishguidelinesfordatacleansingandfiltering

DistinctMethodology
AmethodologywillberequiredtocontrolhowdataflowsintoandoutofBigData
solutions.Itwillneedtoconsiderhowfeedbackloopscanbeestablishedtoenablethe
processeddatatoundergorepeatedrefinement,asshowninFigure3.5.Forexample,an
iterativeapproachmaybeusedtoenablebusinesspersonneltoprovideITpersonnelwith
feedbackonaperiodicbasis.Eachfeedbackcycleprovidesopportunitiesforsystem
refinementbymodifyingdatapreparationordataanalysissteps.

Figure3.5Eachrepetitioncanhelpfine-tuneprocessingsteps,algorithmsanddata
modelstoimprovetheaccuracyofresultsanddelivergreatervaluetothebusiness.

Clouds
AsmentionedinChapter2,cloudsprovideremoteenvironmentsthatcanhostIT
infrastructureforlarge-scalestorageandprocessing,amongotherthings.Regardlessof
whetheranorganizationisalreadycloud-enabled,theadoptionofaBigDataenvironment
maynecessitatethatsomeorallofthatenvironmentbehostedwithinacloud.For
example,anenterprisethatrunsitsCRMsysteminaclouddecidestoaddaBigData
solutioninthesamecloudenvironmentinordertorunanalyticsonitsCRMdata.This
datacanthenbesharedwithitsprimaryBigDataenvironmentthatresideswithinthe
enterpriseboundaries.
CommonjustificationsforincorporatingacloudenvironmentinsupportofaBigData
solutioninclude:
•inadequatein-househardwareresources
•upfrontcapitalinvestmentforsystemprocurementisnotavailable
•theprojectistobeisolatedfromtherestofthebusinesssothatexistingbusiness
processesarenotimpacted
•theBigDatainitiativeisaproofofconcept
•datasetsthatneedtobeprocessedarealreadycloudresident
•thelimitsofavailablecomputingandstorageresourcesusedbyanin-houseBig
Datasolutionarebeingreached

BigDataAnalyticsLifecycle
BigDataanalysisdiffersfromtraditionaldataanalysisprimarilyduetothevolume,
velocityandvarietycharacteristicsofthedatabeingprocesses.Toaddressthedistinct
requirementsforperforminganalysisonBigData,astep-by-stepmethodologyisneeded
toorganizetheactivitiesandtasksinvolvedwithacquiring,processing,analyzingand
repurposingdata.Theupcomingsectionsexploreaspecificdataanalyticslifecyclethat
organizesandmanagesthetasksandactivitiesassociatedwiththeanalysisofBigData.
FromaBigDataadoptionandplanningperspective,itisimportantthatinadditiontothe
lifecycle,considerationbemadeforissuesoftraining,education,toolingandstaffingofa
dataanalyticsteam.
TheBigDataanalyticslifecyclecanbedividedintothefollowingninestages,asshownin
Figure3.6:
1.BusinessCaseEvaluation
2.DataIdentification
3.DataAcquisition&Filtering
4.DataExtraction
5.DataValidation&Cleansing
6.DataAggregation&Representation
7.DataAnalysis
8.DataVisualization
9.UtilizationofAnalysisResults

Figure3.6TheninestagesoftheBigDataanalyticslifecycle.

BusinessCaseEvaluation
EachBigDataanalyticslifecyclemustbeginwithawell-definedbusinesscasethat
presentsaclearunderstandingofthejustification,motivationandgoalsofcarryingoutthe
analysis.TheBusinessCaseEvaluationstageshowninFigure3.7requiresthatabusiness
casebecreated,assessedandapprovedpriortoproceedingwiththeactualhands-on
analysistasks.

Figure3.7Stage1oftheBigDataanalyticslifecycle.
AnevaluationofaBigDataanalyticsbusinesscasehelpsdecision-makersunderstandthe
businessresourcesthatwillneedtobeutilizedandwhichbusinesschallengestheanalysis
willtackle.ThefurtheridentificationofKPIsduringthisstagecanhelpdetermine
assessmentcriteriaandguidancefortheevaluationoftheanalyticresults.IfKPIsarenot
readilyavailable,effortsshouldbemadetomakethegoalsoftheanalysisproject
SMART,whichstandsforspecific,measurable,attainable,relevantandtimely.
Basedonbusinessrequirementsthataredocumentedinthebusinesscase,itcanbe

determinedwhetherthebusinessproblemsbeingaddressedarereallyBigDataproblems.
InordertoqualifyasaBigDataproblem,abusinessproblemneedstobedirectlyrelated
tooneormoreoftheBigDatacharacteristicsofvolume,velocity,orvariety.
Notealsothatanotheroutcomeofthisstageisthedeterminationoftheunderlyingbudget
requiredtocarryouttheanalysisproject.Anyrequiredpurchase,suchastools,hardware
andtraining,mustbeunderstoodinadvancesothattheanticipatedinvestmentcanbe
weighedagainsttheexpectedbenefitsofachievingthegoals.InitialiterationsoftheBig
Dataanalyticslifecyclewillrequiremoreup-frontinvestmentofBigDatatechnologies,
productsandtrainingcomparedtolateriterationswheretheseearlierinvestmentscanbe
repeatedlyleveraged.

DataIdentification
TheDataIdentificationstageshowninFigure3.8isdedicatedtoidentifyingthedatasets
requiredfortheanalysisprojectandtheirsources.