Tải bản đầy đủ - 0 (trang)
Chapter 1. A Regular Expression Matcher

Chapter 1. A Regular Expression Matcher

Tải bản đầy đủ - 0trang

1.ARegularExpressionMatcher

BrianKernighan

Regularexpressionsarenotationsfordescribingpatternsoftextand,ineffect,



makeupaspecial-purposelanguageforpatternmatching.

Althoughtherearemyriadvariants,allsharetheideathatmost

charactersinapatternmatchliteraloccurrencesofthemselves,

butsomemetacharactershavespecialmeaning,suchas*to

indicatesomekindofrepetitionor[…]tomeananyone

characterfromthesetwithinthebrackets.

Inpractice,mostsearchesinprogramssuchastexteditorsare

forliteralwords,sotheregularexpressionsareoftenliteral

stringslikeprint,whichwillmatchprintforsprintor

printerpaperanywhere.Inso-calledwildcardsusedto

specifyfilenamesinUnixandWindows,a*matchesany

numberofcharacters,sothepattern*.cmatchesallfilenames

thatendin.c.Therearemany,manyvariantsofregular

expressions,evenincontextswhereonewouldexpectthemto

bethesame.JeffreyFriedl'sMasteringRegularExpressions

(O'Reilly)isanexhaustivestudyofthetopic.

StephenKleeneinventedregularexpressionsinthemid-1950s

asanotationforfiniteautomata;infact,theyareequivalentto

finiteautomatainwhattheyrepresent.Theyfirstappearedina

programsettinginKenThompson'sversionoftheQEDtext

editorinthemid-1960s.In1967,Thompsonappliedfora

patentonamechanismforrapidtextmatchingbasedon

regularexpressions.Thepatentwasgrantedin1971,oneofthe

veryfirstsoftwarepatents[U.S.Patent3,568,156,Text

MatchingAlgorithm,March2,1971].

RegularexpressionsmovedfromQEDtotheUnixeditored,and

thentothequintessentialUnixtoolgrep,whichThompson

createdbyperformingradicalsurgeryoned.Thesewidelyused

programshelpedregularexpressionsbecomefamiliar



throughouttheearlyUnixcommunity.

Thompson'soriginalmatcherwasveryfastbecauseitcombined

twoindependentideas.Onewastogeneratemachine

instructionsontheflyduringmatchingsothatitranatmachine

speedratherthanbyinterpretation.Theotherwastocarry

forwardallpossiblematchesateachstage,soitdidnothaveto

backtracktolookforalternativepotentialmatches.Inlatertext

editorsthatThompsonwrote,suchased,thematchingcode

usedasimpleralgorithmthatbacktrackedwhennecessary.In

theory,thisisslower,butthepatternsfoundinpracticerarely

involvedbacktracking,sotheedandgrepalgorithmandcode

weregoodenoughformostpurposes.

Subsequentregularexpressionmatcherslikeegrepandfgrep

addedricherclassesofregularexpressions,andfocusedonfast

executionnomatterwhatthepattern.Ever-fancierregular

expressionsbecamepopularandwereincludednotonlyinCbasedlibraries,butalsoaspartofthesyntaxofscripting

languagessuchasAwkandPerl.



1.1.ThePracticeofProgramming

In1998,RobPikeandIwerewritingThePracticeof

Programming(Addison-Wesley).Thelastchapterofthebook,

"Notation,"collectedanumberofexampleswheregood

notationledtobetterprogramsandbetterprogramming.This

includedtheuseofsimpledataspecifications(printf,for

instance),andthegenerationofcodefromtables.

BecauseofourUnixbackgroundsandnearly30yearsof

experiencewithtoolsbasedonregularexpressionnotation,we

naturallywantedtoincludeadiscussionofregularexpressions,

anditseemedmandatorytoincludeanimplementationaswell.

Givenouremphasisontools,italsoseemedbesttofocuson

theclassofregularexpressionsfoundingrep—ratherthan,say,

thosefromshellwildcards—sincewecouldalsothentalkabout



thedesignofgrepitself.

Theproblemwasthatanyexistingregularexpressionpackage

wasfartoobig.Thelocalgrepwasover500lineslong(about

10bookpages)andencrustedwithbarnacles.Opensource

regularexpressionpackagestendedtobehuge—roughlythe

sizeoftheentirebook—becausetheywereengineeredfor

generality,flexibility,andspeed;nonewereremotelysuitable

forpedagogy.

IsuggestedtoRobthatwefindthesmallestregularexpression

packagethatwouldillustratethebasicideaswhilestill

recognizingausefulandnontrivialclassofpatterns.Ideally,the

codewouldfitonasinglepage.

Robdisappearedintohisoffice.AsIrememberitnow,he

emergedinnomorethananhourortwowiththe30linesofC

codethatsubsequentlyappearedinChapter9ofThePracticeof

Programming.Thatcodeimplementsaregularexpression

matcherthathandlesthefollowingconstructs.

Character Meaning

c



Matchesanyliteralcharacterc.



.(period)



Matchesanysinglecharacter.



^



Matchesthebeginningoftheinputstring.



$



Matchestheendoftheinputstring.



*



Matcheszeroormoreoccurrencesoftheprevious

character.



Thisisquiteausefulclass;inmyownexperienceofusing

regularexpressionsonaday-to-daybasis,iteasilyaccountsfor

95percentofallinstances.Inmanysituations,solvingtheright

problemisabigsteptowardcreatingabeautifulprogram.Rob

deservesgreatcreditforchoosingaverysmallyetimportant,



well-defined,andextensiblesetoffeaturesfromamongawide

setofoptions.

Rob'simplementationitselfisasuperbexampleofbeautiful

code:compact,elegant,efficient,anduseful.It'soneofthe

bestexamplesofrecursionthatIhaveeverseen,anditshows

thepowerofCpointers.Althoughatthetimeweweremost

interestedinconveyingtheimportantroleofgoodnotationin

makingaprogrameasiertouse(andperhapseasiertowriteas

well),theregularexpressioncodehasalsobeenanexcellent

waytoillustratealgorithms,datastructures,testing,

performanceenhancement,andotherimportanttopics.







ARegularExpressionMatcher>ThePracticeof

Programming



1.ARegularExpressionMatcher

BrianKernighan

Regularexpressionsarenotationsfordescribingpatternsoftextand,ineffect,



makeupaspecial-purposelanguageforpatternmatching.

Althoughtherearemyriadvariants,allsharetheideathatmost

charactersinapatternmatchliteraloccurrencesofthemselves,

butsomemetacharactershavespecialmeaning,suchas*to

indicatesomekindofrepetitionor[…]tomeananyone

characterfromthesetwithinthebrackets.

Inpractice,mostsearchesinprogramssuchastexteditorsare

forliteralwords,sotheregularexpressionsareoftenliteral

stringslikeprint,whichwillmatchprintforsprintor

printerpaperanywhere.Inso-calledwildcardsusedto

specifyfilenamesinUnixandWindows,a*matchesany

numberofcharacters,sothepattern*.cmatchesallfilenames

thatendin.c.Therearemany,manyvariantsofregular

expressions,evenincontextswhereonewouldexpectthemto

bethesame.JeffreyFriedl'sMasteringRegularExpressions

(O'Reilly)isanexhaustivestudyofthetopic.

StephenKleeneinventedregularexpressionsinthemid-1950s

asanotationforfiniteautomata;infact,theyareequivalentto

finiteautomatainwhattheyrepresent.Theyfirstappearedina

programsettinginKenThompson'sversionoftheQEDtext

editorinthemid-1960s.In1967,Thompsonappliedfora

patentonamechanismforrapidtextmatchingbasedon

regularexpressions.Thepatentwasgrantedin1971,oneofthe

veryfirstsoftwarepatents[U.S.Patent3,568,156,Text

MatchingAlgorithm,March2,1971].

RegularexpressionsmovedfromQEDtotheUnixeditored,and

thentothequintessentialUnixtoolgrep,whichThompson

createdbyperformingradicalsurgeryoned.Thesewidelyused

programshelpedregularexpressionsbecomefamiliar



throughouttheearlyUnixcommunity.

Thompson'soriginalmatcherwasveryfastbecauseitcombined

twoindependentideas.Onewastogeneratemachine

instructionsontheflyduringmatchingsothatitranatmachine

speedratherthanbyinterpretation.Theotherwastocarry

forwardallpossiblematchesateachstage,soitdidnothaveto

backtracktolookforalternativepotentialmatches.Inlatertext

editorsthatThompsonwrote,suchased,thematchingcode

usedasimpleralgorithmthatbacktrackedwhennecessary.In

theory,thisisslower,butthepatternsfoundinpracticerarely

involvedbacktracking,sotheedandgrepalgorithmandcode

weregoodenoughformostpurposes.

Subsequentregularexpressionmatcherslikeegrepandfgrep

addedricherclassesofregularexpressions,andfocusedonfast

executionnomatterwhatthepattern.Ever-fancierregular

expressionsbecamepopularandwereincludednotonlyinCbasedlibraries,butalsoaspartofthesyntaxofscripting

languagessuchasAwkandPerl.



1.1.ThePracticeofProgramming

In1998,RobPikeandIwerewritingThePracticeof

Programming(Addison-Wesley).Thelastchapterofthebook,

"Notation,"collectedanumberofexampleswheregood

notationledtobetterprogramsandbetterprogramming.This

includedtheuseofsimpledataspecifications(printf,for

instance),andthegenerationofcodefromtables.

BecauseofourUnixbackgroundsandnearly30yearsof

experiencewithtoolsbasedonregularexpressionnotation,we

naturallywantedtoincludeadiscussionofregularexpressions,

anditseemedmandatorytoincludeanimplementationaswell.

Givenouremphasisontools,italsoseemedbesttofocuson

theclassofregularexpressionsfoundingrep—ratherthan,say,

thosefromshellwildcards—sincewecouldalsothentalkabout



thedesignofgrepitself.

Theproblemwasthatanyexistingregularexpressionpackage

wasfartoobig.Thelocalgrepwasover500lineslong(about

10bookpages)andencrustedwithbarnacles.Opensource

regularexpressionpackagestendedtobehuge—roughlythe

sizeoftheentirebook—becausetheywereengineeredfor

generality,flexibility,andspeed;nonewereremotelysuitable

forpedagogy.

IsuggestedtoRobthatwefindthesmallestregularexpression

packagethatwouldillustratethebasicideaswhilestill

recognizingausefulandnontrivialclassofpatterns.Ideally,the

codewouldfitonasinglepage.

Robdisappearedintohisoffice.AsIrememberitnow,he

emergedinnomorethananhourortwowiththe30linesofC

codethatsubsequentlyappearedinChapter9ofThePracticeof

Programming.Thatcodeimplementsaregularexpression

matcherthathandlesthefollowingconstructs.

Character Meaning

c



Matchesanyliteralcharacterc.



.(period)



Matchesanysinglecharacter.



^



Matchesthebeginningoftheinputstring.



$



Matchestheendoftheinputstring.



*



Matcheszeroormoreoccurrencesoftheprevious

character.



Thisisquiteausefulclass;inmyownexperienceofusing

regularexpressionsonaday-to-daybasis,iteasilyaccountsfor

95percentofallinstances.Inmanysituations,solvingtheright

problemisabigsteptowardcreatingabeautifulprogram.Rob

deservesgreatcreditforchoosingaverysmallyetimportant,



well-defined,andextensiblesetoffeaturesfromamongawide

setofoptions.

Rob'simplementationitselfisasuperbexampleofbeautiful

code:compact,elegant,efficient,anduseful.It'soneofthe

bestexamplesofrecursionthatIhaveeverseen,anditshows

thepowerofCpointers.Althoughatthetimeweweremost

interestedinconveyingtheimportantroleofgoodnotationin

makingaprogrameasiertouse(andperhapseasiertowriteas

well),theregularexpressioncodehasalsobeenanexcellent

waytoillustratealgorithms,datastructures,testing,

performanceenhancement,andotherimportanttopics.







ARegularExpressionMatcher>Implementation



1.2.Implementation

InThePracticeofProgramming,theregularexpressionmatcher

ispartofastandaloneprogramthatmimicsgrep,butthe

regularexpressioncodeiscompletelyseparablefromits

surroundings.Themainprogramisnotinterestinghere;like

manyUnixtools,itreadseitheritsstandardinputora

sequenceoffiles,andprintsthoselinesthatcontainamatchof

theregularexpression.

Thisisthematchingcode:

CodeView:Scroll/ShowAll



























/*match:searchforregexpanywhereintext*/

intmatch(char*regexp,char*text)

{

if(regexp[0]=='^')

returnmatchhere(regexp+1,text);

do{/*mustlookevenifstringisempty*/

if(matchhere(regexp,text))

return1;

}while(*text++!='\0');

return0;

}



















/*matchhere:searchforregexpatbeginningoftext*/

intmatchhere(char*regexp,char*text)

{

if(regexp[0]=='\0')

return1;

if(regexp[1]=='*')

returnmatchstar(regexp[0],regexp+2,text);



















if(regexp[0]=='$'&®exp[1]=='\0')

return*text=='\0';

if(*text!='\0'&&(regexp[0]=='.'||regexp[0]==*te

returnmatchhere(regexp+1,text+1);

return0;

}

























/*matchstar:searchforc*regexpatbeginningoftext

intmatchstar(intc,char*regexp,char*text)

{

do{/*a*matcheszeroormoreinstances*/

if(matchhere(regexp,text))

return1;

}while(*text!='\0'&&(*text++==c||c=='.')

return0;

}



























Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 1. A Regular Expression Matcher

Tải bản đầy đủ ngay(0 tr)

×