Hack 56. Predict the Outcome of a Baseball Game

Duringthefirstcouplehoursofabaseballgame,turnonthe

identifytheteamthatisatbat.Thatteamhasagreaterthan

50percentchanceofwinningthatgame.

WhyItWorks

Baseballisagamewherethelongeryouareonoffense,the

morepointsyoucanscore.Asmorebatterscometobatina

singleinning,thechancesofmovingrunnersalongthebase

pathsandacrosshomeplateincreases.Anotherwaytolookat

itistoimaginetheendofaninningthatwashugeforone

considerablymorethantheminimumofthreebattersinthat

inningand,consequently,beenatbataproportionatelylonger

lengthoftimethantheotherteam.Overthecourseofagame,

theteamthatisatbatlongestismorelikelytoscoremore(or

havemoreproductiveinnings).

Samplingtheory[Hack#19]suggeststhatasampleismost

likelytocapturethemostcommonelementsofapopulation.

Ourpopulationhereisallthemomentsduringagamethatwe

couldlistento.Themostcommoncharacteristicinthe

population(intermsofwhoisatbat)belongstotheteamthat

isatbatthemost.

Figure5-4suggestsapossibledistributionofat-battimefora

regulationnine-inninggame.Inthisexample,thewinningteam

wasonoffensefor58percentofthetime.Inretrospect,a

findingthewinningteamatbat.

Figure5-4.Timeatbatforwinningandlosing

teams

Theaccuracyofpredictionshouldbeabove50percentoverthe

accurate.Thisisbecausetherelationshipbetweentimeatbat

andscoringavictoryisnotaperfectcorrelation[Hack#11].

Playerscanscorequicklyhitahomerunontheirfirstpitch,for

exampleortheycantaketheirtimegettingmanyhitsbutstrand

manyrunnersandneverscore.

Overall,thecorrelationbetweenthetwovariablesshouldbe

positive,however.Eventheperhapsunimpressive58percent

accuracyinmyimagineddatainFigure5-4meansthatyouwill

beright16percentmoreoftenthanablindguess.Withsuchan

aweek.

ProvingItWorks

Totesttheaccuracyofmyclaim,youcanusethedatathat

appearsinyourdailynewspaper.Whilemostboxscoresdonot

isavariablethatprovidesalmostthesameinformation.There

willalmostcertainlybea"totalat-bats"reported.Whilethis

statisticisnotthesameastimespentatbat,itshouldcorrelate

prettyhighly.Eachday,thisinformationisprovidedformore

beenoughtotestmytheory.Gatherthetotalat-batsforeach

team,includingwhichteamwonthegame.

Real-liferesearchersoftendon'thaveaccesstothevariabletheywould

nextbestthingavailable.Scientistscallthesesubstitutesproxy

variablesorsurrogatevariables.

Myhypothesisisthattheteamwiththemostat-batsshouldwin

thegamemorethan50percentofthetime.Outofcuriosity,I

testedthishypothesismyself.IusedtheChicagoCubsasan

Iarbitrarilychose2003andtheCubs'first25games.An

situationswherethereweretiesinat-bats,Icouldhave

predictedwith63percentaccuracy.

Whiletheteamwiththefewestat-batssometimesdidwinthe

ChicagoCubsgames,thelargerthediscrepancybetweenatbats,themorelikelytheteamwiththemostat-batswastowin

thegame.Whenthemost-at-batsteamswon,theyaveraged

4.14moreat-batsthantheloser.Whentheleast-at-batsteams

won,theyaveragedonly2.88at-batslessthantheloser.

OtherPlacesItWorks

Somepeoplehavesuggestedthatinthecaseofmyteam,the

KansasCityRoyals,ifIwanttoberightmorethanhalfthe

time,Ishouldalwayspredictaloss.Yes,yes,veryfunny.

WhereItDoesn'tWork

Theaccuracyofthismethodshouldbelowifyouturnonthe

duringthefirstcouplehoursofthegame.Undertherulesof

inning,theynevercometobat.Theywin.Gameover.Ashome

teamswinmoreoftenthanvisitingteams,thismeansthat

oftenthewinningteamnevercomestobatatallintheninth

inning.

Thispresentsaninterestingvariationofthispredictionmethod

thatappliesonlytotheninthinning.Turnonthegameinthe

ninthinning;ifyourteamisbatting,thingsdon'tlooksogood.

ThedatapresentedfortheChicagoCubsthatfoundthewinning

teamoccasionallyhavingfewerat-batsthantheiropponentcan

bepartlyexplainedbythefactthatthewinningteam

sometimesbatsinonlyeightinnings.

example,timeofpossessionwouldn'tbeexpectedtopositively

correlatewithpointsscoredand,inthecaseofhigh-energy,

fast-scoringteams,mightevennegativelycorrelate.Infootball,

ontheotherhand,timeofpositionisconsideredakeyindicator

ofqualityperformanceandusuallycorrelateswithawin.

Hack57.PlotHistogramsinExcel

canhaveabetterunderstandingofstatistics.

Thereissometruthtotheclich\x8e"apictureiswortha

thousandwords."Apictureisoftenthebestwaytounderstand

1,000numbers.Peoplearevisuallyoriented.We'regoodat

lookingatapictureandobservingdifferentcharacteristics;

Oneofthemostpowerfultoolsavailableforunderstandingdata

isthehistogram,apictureofthedistributionofvalues.Hereis

theideaofahistogram.Supposeyouhavealotofdatasay,the

battingaveragesforall6,032baseballplayersbetween1955

and2004whoaveraged3.1ormoreplateappearancesper

game.Let'salsoassumeyouwanttoknowhowthesevalues

aredistributed.Whatarethelowestandhighestvalues?Are

theremorelowvaluesthanhighvalues?Werebattingaverages

totallyrandomnumbersbetween0and.400,orwasthere

somepattern?

Battingaveragecantakemanydifferentvalues.Between1955

therewere1,229uniquevaluesforbattingaverage.Youcan

plotthenumberofplayerswitheachuniquebattingaverage

(thoughIcan'timaginewhatthisgraphwouldlooklike).But

playerswithverysimilarbattingaveragessay,between.285and

.290.

Let'sthinkofeachrangeasabucket.Everyplayer-seasongoes

average,sowe'llputthatseasoninthe.350-.355bucket.So,

here'sourplan:we'llputeachplayer-seasonintoabucket,

countthenumberofplayer-seasonsineachbucket,anddrawa

graphshowing(inascendingorder)thenumberofplayersin

eachbucket.Thissinglediagramisahistogram.

TheCode

Inthisexample,Iwantedtolookatthedistributionofbatting

average.Iusedatablecontainingthetotalbattingstatisticsfor

eachplayerineachyear(andthelistofallteamsforwhich

eachplayerplayed),andIcalledthetableb_and_t.Iselected

onlybatterswithenoughplateappearancestoqualifyfora

leaguetitle,andonlythoseplayerswhoplayedbetween1955

and2004:

SELECTb.playerID,M.nameLast,M.nameFirst,b.yearID,b.teamG,

b.teamIDs,b.AB,b.H,

b.H/b.ABASAVG,

b.AB+b.BB+b.HBP+b.SFasPA

FROMb_and_tbinnerjoinMasterM

onb.playerID=m.playerID

WHEREyearID>1954

ANDb.AB+b.BB+b.HBP+b.SF>b.teamG*3.1;

Afterrunningthisquery,IsavedtheresultstoanExcelfile

namedbatting_averages.xls.

OnewaytodrawhistogramsinExcelistousetheAnalysis