Tải bản đầy đủ
Chapter 8. Big Data Analysis Techniques

Chapter 8. Big Data Analysis Techniques

Tải bản đầy đủ

In2003,WilliamAgrestirecognizedtheshifttowardcomputational
approachesandarguedforthecreationofanewcomputationaldiscipline
namedDiscoveryInformatics.Agresti’sviewofthisfieldwasonethat
embracedcomposition.Inotherwords,hebelievedthatdiscoveryinformatics
wasasynthesisofthefollowingfields:patternrecognition(datamining);
artificialintelligence(machinelearning);documentandtextprocessing
(semanticprocessing);databasemanagementandinformationstorageand
retrieval.Agresti’sinsightintotheimportanceandbreadthofcomputational
approachestodataanalysiswasforward-thinkingatthetime,andhis
perspectiveonthematterhasonlybeenreinforcedbythepassageoftimeand
theemergenceofdatascienceasadiscipline.
InanyfastmovingfieldlikeBigData,therearealwaysopportunitiesforinnovation.An
exampleofthisisthequestionofhowtobestblendstatisticalandcomputational
approachesforagivenanalyticalproblem.Statisticaltechniquesarecommonlypreferred
forexploratorydataanalysis,afterwhichcomputationaltechniquesthatleveragethe
insightgleanedfromthestatisticalstudyofadatasetcanbeapplied.Theshiftfrombatch
torealtimepresentsotherchallengesasrealtimetechniquesneedtoleverage
computationally-efficientalgorithms.
Onechallengeconcernsthebestwayofbalancingtheaccuracyofananalyticresult
againsttherun-timeofthealgorithm.Inmanycases,anapproximationmaybesufficient
andaffordable.Fromastorageperspective,multi-tieredstoragesolutionswhichleverage
RAM,solid-statedrivesandhard-diskdriveswillprovidenear-termflexibilityand
realtimeanalyticcapabilitywithlong-term,cost-effectivepersistentstorage.Inthelong
run,anorganizationwilloperateitsBigDataanalysisengineattwospeeds:processing
streamingdataasitarrivesandperformingbatchanalysisofthisdataasitaccumulatesto
lookforpatternsandtrends.(Thesymbolusedtorepresentdataanalysisisshownin
Figure8.1.)

Figure8.1Thesymbolusedtorepresentdataanalysis.
Thischapterbeginswithdescriptionsofthefollowingbasictypesofdataanalysis:
•quantitativeanalysis
•qualitativeanalysis
•datamining
•statisticalanalysis

•machinelearning
•semanticanalysis
•visualanalysis

QuantitativeAnalysis
Quantitativeanalysisisadataanalysistechniquethatfocusesonquantifyingthepatterns
andcorrelationsfoundinthedata.Basedonstatisticalpractices,thistechniqueinvolves
analyzingalargenumberofobservationsfromadataset.Sincethesamplesizeislarge,
theresultscanbeappliedinageneralizedmannertotheentiredataset.Figure8.2depicts
thefactthatquantitativeanalysisproducesnumericalresults.

Figure8.2Theoutputofquantitativeanalysisisnumericalinnature.
Quantitativeanalysisresultsareabsoluteinnatureandcanthereforebeusedfornumerical
comparisons.Forexample,aquantitativeanalysisoficecreamsalesmaydiscoverthata5
degreeincreaseintemperatureincreasesicecreamsalesby15%.

QualitativeAnalysis
Qualitativeanalysisisadataanalysistechniquethatfocusesondescribingvariousdata
qualitiesusingwords.Itinvolvesanalyzingasmallersampleingreaterdepthcomparedto
quantitativedataanalysis.Theseanalysisresultscannotbegeneralizedtoanentiredataset
duetothesmallsamplesize.Theyalsocannotbemeasurednumericallyorusedfor
numericalcomparisons.Forexample,ananalysisoficecreamsalesmayrevealthatMay’s
salesfigureswerenotashighasJune’s.Theanalysisresultsstateonlythatthefigures
were“notashighas,”anddonotprovideanumericaldifference.Theoutputofqualitative
analysisisadescriptionoftherelationshipusingwordsasshowninFigure8.3.

Figure8.3Qualitativeresultsaredescriptiveinnatureandnotgeneralizabletothe
entiredataset.

DataMining
Datamining,alsoknownasdatadiscovery,isaspecializedformofdataanalysisthat
targetslargedatasets.InrelationtoBigDataanalysis,datamininggenerallyrefersto
automated,software-basedtechniquesthatsiftthroughmassivedatasetstoidentify
patternsandtrends.
Specifically,itinvolvesextractinghiddenorunknownpatternsinthedatawiththe
intentionofidentifyingpreviouslyunknownpatterns.Dataminingformsthebasisfor
predictiveanalyticsandbusinessintelligence(BI).Thesymbolusedtorepresentdata
miningisshowninFigure8.4.

Figure8.4Thesymbolusedtorepresentdatamining.

StatisticalAnalysis
Statisticalanalysisusesstatisticalmethodsbasedonmathematicalformulasasameansfor
analyzingdata.Statisticalanalysisismostoftenquantitative,butcanalsobequalitative.
Thistypeofanalysisiscommonlyusedtodescribedatasetsviasummarization,suchas
providingthemean,median,ormodeofstatisticsassociatedwiththedataset.Itcanalso
beusedtoinferpatternsandrelationshipswithinthedataset,suchasregressionand
correlation.
Thissectiondescribesthefollowingtypesofstatisticalanalysis:
•A/BTesting
•Correlation
•Regression

A/BTesting
A/Btesting,alsoknownassplitorbuckettesting,comparestwoversionsofanelementto
determinewhichversionissuperiorbasedonapre-definedmetric.Theelementcanbea
rangeofthings.Forexample,itcanbecontent,suchasaWebpage,oranofferfora
productorservice,suchasdealsonelectronicitems.Thecurrentversionoftheelementis
calledthecontrolversion,whereasthemodifiedversioniscalledthetreatment.Both
versionsaresubjectedtoanexperimentsimultaneously.Theobservationsarerecordedto
determinewhichversionismoresuccessful.
AlthoughA/Btestingcanbeimplementedinalmostanydomain,itismostoftenusedin
marketing.Generally,theobjectiveistogaugehumanbehaviorwiththegoalofincreasing
sales.Forexample,inordertodeterminethebestpossiblelayoutforanicecreamadon
CompanyA’sWebsite,twodifferentversionsoftheadareused.VersionAisanexisting
ad(thecontrol)whileVersionBhashaditslayoutslightlyaltered(thetreatment).Both

versionsarethensimultaneouslyshowntodifferentusers:
•VersionAtoGroupA
•VersionBtoGroupB
TheanalysisoftheresultsrevealsthatVersionBoftheadresultedinmoresalesas
comparedtoVersionA.
Inotherareassuchasthescientificdomains,theobjectivemaysimplybetoobserve
whichversionworksbetterinordertoimproveaprocessorproduct.Figure8.5provides
anexampleofA/Btestingontwodifferentemailversionssentsimultaneously.

Figure8.5Twodifferentemailversionsaresentoutsimultaneouslyaspartofa
marketingcampaigntoseewhichversionbringsinmoreprospectivecustomers.
Samplequestionscaninclude:
•Isthenewversionofadrugbetterthantheoldone?
•Docustomersrespondbettertoadvertisementsdeliveredbyemailorpostalmail?
•IsthenewlydesignedhomepageoftheWebsitegeneratingmoreusertraffic?

Correlation
Correlationisananalysistechniqueusedtodeterminewhethertwovariablesarerelatedto
eachother.Iftheyarefoundtoberelated,thenextstepistodeterminewhattheir
relationshipis.Forexample,thevalueofVariableAincreaseswheneverthevalueof
VariableBincreases.WemaybefurtherinterestedindiscoveringhowcloselyVariablesA
andBarerelated,whichmeanswemayalsowanttoanalyzetheextenttowhichVariable
BincreasesinrelationtoVariableA’sincrease.
Theuseofcorrelationhelpstodevelopanunderstandingofadatasetandfind
relationshipsthatcanassistinexplainingaphenomenon.Correlationistherefore
commonlyusedfordataminingwheretheidentificationofrelationshipsbetween
variablesinadatasetleadstothediscoveryofpatternsandanomalies.Thiscanrevealthe
natureofthedatasetorthecauseofaphenomenon.
Whentwovariablesareconsideredtobecorrelatedtheyarealignedbasedonalinear
relationship.Thismeansthatwhenonevariablechanges,theothervariablealsochanges
proportionallyandconstantly.
Correlationisexpressedasadecimalnumberbetween–1to+1,whichisknownasthe
correlationcoefficient.Thedegreeofrelationshipchangesfrombeingstrongtoweak

whenmovingfrom–1to0or+1to0.
Figure8.6showsacorrelationof+1,whichsuggeststhatthereisastrongpositive
relationshipbetweenthetwovariables.

Figure8.6Whenonevariableincreases,theotheralsoincreasesandviceversa.
Figure8.7showsacorrelationof0,whichsuggeststhatthereisnorelationshipatall
betweenthetwovariables.

Figure8.7Whenonevariableincreases,theothermaystaythesame,orincreaseor
decreasearbitrarily.
InFigure8.8,aslopeof–1suggeststhatthereisastrongnegativerelationshipbetween
thetwovariables.

Figure8.8Whenonevariableincreases,theotherdecreasesandviceversa.
Forexample,managersbelievethaticecreamstoresneedtostockmoreicecreamforhot
days,butdon’tknowhowmuchextratostock.Todetermineifarelationshipactually
existsbetweentemperatureandicecreamsales,theanalystsfirstapplycorrelationtothe
numberoficecreamssoldandtherecordedtemperaturereadings.Avalueof+0.75
suggeststhatthereexistsastrongrelationshipbetweenthetwo.Thisrelationshipindicates
thatastemperatureincreases,moreicecreamsaresold.
Furthersamplequestionsaddressedbycorrelationcaninclude:
•Doesdistancefromtheseaaffectthetemperatureofacity?
•Dostudentswhoperformwellatelementaryschoolperformequallywellathigh
school?
•Towhatextentisobesitylinkedwithovereating?

Regression
Theanalysistechniqueofregressionexploreshowadependentvariableisrelatedtoan
independentvariablewithinadataset.Asasamplescenario,regressioncouldhelp
determinethetypeofrelationshipthatexistsbetweentemperature,theindependent
variable,andcropyield,thedependentvariable.
Applyingthistechniquehelpsdeterminehowthevalueofthedependentvariablechanges
inrelationtochangesinthevalueoftheindependentvariable.Whentheindependent
variableincreases,forexample,doesthedependentvariablealsoincrease?Ifyes,isthe
increaseinalinearornon-linearproportion?
Forexample,inordertodeterminehowmuchextrastockeachicecreamstoreneedsto
have,theanalystsapplyregressionbyfeedinginthevaluesoftemperaturereadings.These
valuesarebasedontheweatherforecastasanindependentvariableandthenumberofice
creamssoldasthedependentvariable.Whattheanalystsdiscoveristhat15%of
additionalstockisrequiredforevery5-degreeincreaseintemperature.

Morethanoneindependentvariablecanbetestedatthesametime.However,insuch
cases,onlyoneindependentvariablemaychange,whileothersarekeptconstant.
Regressioncanhelpenableabetterunderstandingofwhataphenomenonisandwhyit
occurred.Itcanalsobeusedtomakepredictionsaboutthevaluesofthedependent
variable.
Linearregressionrepresentsaconstantrateofchange,asshowninFigure8.9.

Figure8.9Linearregression
Non-linearregressionrepresentsavariablerateofchange,asshowninFigure8.10.

Figure8.10Non-linearregression

Samplequestionscaninclude:
•Whatwillbethetemperatureofacitythatis250milesawayfromthesea?
•Whatwillbethegradesofastudentstudyingatahighschoolbasedontheir
primaryschoolgrades?
•Whatarethechancesthatapersonwillbeobesebasedontheamountoftheirfood
intake?
Regressionandcorrelationhaveanumberofimportantdifferences.Correlationdoesnot
implycausation.Thechangeinthevalueofonevariablemaynotberesponsibleforthe
changeinthevalueofthesecondvariable,althoughbothmaychangeatthesamerate.
Thiscanoccurduetoanunknownthirdvariable,knownastheconfoundingfactor.
Correlationassumesthatbothvariablesareindependent.
Regression,ontheotherhand,isapplicabletovariablesthathavepreviouslybeen
identifiedasdependentandindependentvariablesandimpliesthatthereisadegreeof
causationbetweenthevariables.Thecausationmaybedirectorindirect.
WithinBigData,correlationcanfirstbeappliedtodiscoverifarelationshipexists.
Regressioncanthenbeappliedtofurtherexploretherelationshipandpredictthevaluesof
thedependentvariable,basedontheknownvaluesoftheindependentvariable.

MachineLearning
Humansaregoodatspottingpatternsandrelationshipswithindata.Unfortunately,we
cannotprocesslargeamountsofdataveryquickly.Machines,ontheotherhand,arevery
adeptatprocessinglargeamountsofdataquickly,butonlyiftheyknowhow.
Ifhumanknowledgecanbecombinedwiththeprocessingspeedofmachines,machines
willbeabletoprocesslargeamountsofdatawithoutrequiringmuchhumanintervention.
Thisisthebasicconceptofmachinelearning.
Inthissection,machinelearninganditsrelationshiptodataminingareexploredthrough
coverageofthefollowingtypesofmachinelearningtechniques:
•Classification
•Clustering
•OutlierDetection
•Filtering

Classification(SupervisedMachineLearning)
Classificationisasupervisedlearningtechniquebywhichdataisclassifiedintorelevant,
previouslylearnedcategories.Itconsistsoftwosteps:
1.Thesystemisfedtrainingdatathatisalreadycategorizedorlabeled,sothatitcan
developanunderstandingofthedifferentcategories.
2.Thesystemisfedunknownbutsimilardataforclassificationandbasedonthe
understandingitdevelopedfromthetrainingdata,thealgorithmwillclassifythe

unlabeleddata.
Acommonapplicationofthistechniqueisforthefilteringofemailspam.Notethat
classificationcanbeperformedfortwoormorecategories.Inasimplifiedclassification
process,themachineisfedlabeleddataduringtrainingthatbuildsitsunderstandingofthe
classification,asshowninFigure8.11.Themachineisthenfedunlabeleddata,whichit
classifiesitself.

Figure8.11Machinelearningcanbeusedtoautomaticallyclassifydatasets.
Forexample,abankwantstofindoutwhichofitscustomersislikelytodefaultonloan
payments.Basedonhistoricdata,atrainingdatasetiscompiledthatcontainslabeled
examplesofcustomersthathaveorhavenotpreviouslydefaulted.Thistrainingdataisfed
toaclassificationalgorithmthatisusedtodevelopanunderstandingof“good”and“bad”
customers.Finally,newuntaggedcustomerdataisfedinordertofindoutwhetheragiven
customerbelongstothedefaultingcategory.
Samplequestionscaninclude:
•Shouldanapplicant’screditcardapplicationbeacceptedorrejectedbasedonother
acceptedorrejectedapplications?
•Isatomatoafruitoravegetablebasedontheknownexamplesoffruitand
vegetables?
•Dothemedicaltestresultsforthepatientindicateariskforaheartattack?

Clustering(UnsupervisedMachineLearning)
Clusteringisanunsupervisedlearningtechniquebywhichdataisdividedintodifferent
groupssothatthedataineachgrouphassimilarproperties.Thereisnopriorlearningof
categoriesrequired.Instead,categoriesareimplicitlygeneratedbasedonthedata
groupings.Howthedataisgroupeddependsonthetypeofalgorithmused.Each
algorithmusesadifferenttechniquetoidentifyclusters.
Clusteringisgenerallyusedindataminingtogetanunderstandingofthepropertiesofa
givendataset.Afterdevelopingthisunderstanding,classificationcanbeusedtomake

betterpredictionsaboutsimilarbutneworunseendata.
Clusteringcanbeappliedtothecategorizationofunknowndocumentsandtopersonalized
marketingcampaignsbygroupingtogethercustomerswithsimilarbehavior.Ascatter
graphprovidesavisualrepresentationofclustersinFigure8.12.

Figure8.12Ascattergraphsummarizestheresultsofclustering.
Forexample,abankwantstointroduceitsexistingcustomerstoarangeofnewfinancial
productsbasedonthecustomerprofilesithasonrecord.Theanalystscategorize
customersintomultiplegroupsusingclustering.Eachgroupisthenintroducedtooneor
morefinancialproductsmostsuitabletothecharacteristicsoftheoverallprofileofthe
group.
Samplequestionscaninclude:
•Howmanydifferentspeciesoftreesexistbasedonthesimilaritybetweentrees?
•Howmanygroupsofcustomersexistbaseduponsimilarpurchasehistory?
•Whatarethedifferentgroupsofvirusesbasedontheircharacteristics?

OutlierDetection
Outlierdetectionistheprocessoffindingdatathatissignificantlydifferentfromor
inconsistentwiththerestofthedatawithinagivendataset.Thismachinelearning
techniqueisusedtoidentifyanomalies,abnormalitiesanddeviationsthatcanbe
advantageous,suchasopportunities,orunfavorable,suchasrisks.
Outlierdetectioniscloselyrelatedtotheconceptofclassificationandclustering,although
itsalgorithmsfocusonfindingabnormalvalues.Itcanbebasedoneithersupervisedor
unsupervisedlearning.Applicationsforoutlierdetectionincludefrauddetection,medical
diagnosis,networkdataanalysisandsensordataanalysis.Ascattergraphvisually
highlightsdatapointsthatareoutliers,asshowninFigure8.13.

Figure8.13Ascattergraphhighlightsanoutlier.
Forexample,inordertofindoutwhetherornotatransactionislikelytobefraudulent,the
bank’sITteambuildsasystememployinganoutlierdetectiontechniquethatisbasedon
supervisedlearning.Asetofknownfraudulenttransactionsisfirstfedintotheoutlier
detectionalgorithm.Aftertrainingthesystem,unknowntransactionsarethenfedintothe
outlierdetectionalgorithmtopredictiftheyarefraudulentornot.
Samplequestionscaninclude:
•Isanathleteusingperformanceenhancingdrugs?
•Arethereanywronglyidentifiedfruitsandvegetablesinthetrainingdatasetused
foraclassificationtask?
•Isthereaparticularstrainofvirusthatdoesnotrespondtomedication?

Filtering
Filteringistheautomatedprocessoffindingrelevantitemsfromapoolofitems.Items
canbefilteredeitherbasedonauser’sownbehaviororbymatchingthebehaviorof
multipleusers.Filteringisgenerallyappliedviathefollowingtwoapproaches:
•collaborativefiltering
•content-basedfiltering
Acommonmediumbywhichfilteringisimplementedisviatheuseofarecommender
system.Collaborativefilteringisanitemfilteringtechniquebasedonthecollaboration,or
merging,ofauser’spastbehaviorwiththebehaviorsofothers.Atargetuser’spast
behavior,includingtheirlikes,ratings,purchasehistoryandmore,iscollaboratedwiththe
behaviorofsimilarusers.Basedonthesimilarityoftheusers’behavior,itemsarefiltered
forthetargetuser.
Collaborativefilteringissolelybasedonthesimilaritybetweenusers’behavior.Itrequires
alargeamountofuserbehaviordatainordertoaccuratelyfilteritems.Itisanexampleof