Tải bản đầy đủ - 0 (trang)
Chapter 11. Pattern Matching with Regular Expressions

Chapter 11. Pattern Matching with Regular Expressions

Tải bản đầy đủ - 0trang

11.1.DefiningRegularExpressions

InJavaScript,regularexpressionsarerepresentedbyRegExp

objects.RegExpobjectsmaybecreatedwiththeRegExp()

constructor,ofcourse,buttheyaremoreoftencreatedusinga

specialliteralsyntax.Justasstringliteralsarespecifiedas

characterswithinquotationmarks,regularexpressionliterals

arespecifiedascharacterswithinapairofslash(/)characters.

Thus,yourJavaScriptcodemaycontainlineslikethis:

varpattern=/s$/;



ThislinecreatesanewRegExpobjectandassignsittothe

variablepattern.ThisparticularRegExpobjectmatchesany

stringthatendswiththeletter"s."(I'llgetintothegrammar

fordefiningpatternsshortly.)Thisregularexpressioncould

haveequivalentlybeendefinedwiththeRegExp()constructor

likethis:

varpattern=newRegExp("s$");



CreatingaRegExpobject,eitherliterallyorwiththeRegExp()

constructor,istheeasypart.Themoredifficulttaskis

describingthedesiredpatternofcharactersusingregular

expressionsyntax.JavaScriptadoptsafairlycompletesubsetof

theregular-expressionsyntaxusedbyPerl,soifyouarean

experiencedPerlprogrammer,youalreadyknowhowto

describepatternsinJavaScript.

Regular-expressionpatternspecificationsconsistofaseriesof

characters.Mostcharacters,includingallalphanumeric

characters,simplydescribecharacterstobematchedliterally.



Thus,theregularexpression/java/matchesanystringthat

containsthesubstring"java".Othercharactersinregular

expressionsarenotmatchedliterallybuthavespecial

significance.Forexample,theregularexpression/s$/contains

twocharacters.Thefirst,"s",matchesitselfliterally.The

second,"$",isaspecialmetacharacterthatmatchestheendof

astring.Thus,thisregularexpressionmatchesanystringthat

containstheletter"s"asitslastcharacter.

Thefollowingsectionsdescribethevariouscharactersand

metacharactersusedinJavaScriptregularexpressions.Note,

however,thatacompletetutorialonregular-expression

grammarisbeyondthescopeofthisbook.Forcompletedetails

ofthesyntax,consultabookonPerl,suchasProgrammingPerl

byLarryWalletal.(O'Reilly).MasteringRegularExpressionsby

JeffreyE.F.Friedl(O'Reilly)isanotherexcellentsourceof

informationonregularexpressions.



11.1.1.LiteralCharacters

Asnotedearlier,allalphabeticcharactersanddigitsmatch

themselvesliterallyinregularexpressions.JavaScriptregularexpressionsyntaxalsosupportscertainnonalphabetic

charactersthroughescapesequencesthatbeginwitha

backslash(\).Forexample,thesequence\nmatchesaliteral

newlinecharacterinastring.Table11-1liststhesecharacters.

Table11-1.Regular-expressionliteralcharacters

Character



Matches



Alphanumeric

character



Itself



\0



TheNULcharacter(\u0000)



\t



Tab(\u0009)



\n



Newline(\u000A)



\v



Verticaltab(\u000B)



\f



Formfeed(\u000C)



\r



Carriagereturn(\u000D)



\xnn



TheLatincharacterspecifiedbythehexadecimalnumbernn;forexample,

\x0Aisthesameas\n



\uxxxx



TheUnicodecharacterspecifiedbythehexadecimalnumberxxxx;for

example,\u0009isthesameas\t



\cX



Thecontrolcharacter^X;forexample,\cJisequivalenttothenewline

character\n



Anumberofpunctuationcharactershavespecialmeaningsin

regularexpressions.Theyare:

^$.*+?=!:|\/()[]{}



Themeaningsofthesecharactersarediscussedinthesections

thatfollow.Someofthesecharactershavespecialmeaningonly

withincertaincontextsofaregularexpressionandaretreated

literallyinothercontexts.Asageneralrule,however,ifyou

wanttoincludeanyofthesepunctuationcharactersliterallyina

regularexpression,youmustprecedethemwitha\.Other

punctuationcharacters,suchasquotationmarksand@,donot

havespecialmeaningandsimplymatchthemselvesliterallyina

regularexpression.

Ifyoucan'trememberexactlywhichpunctuationcharacters



needtobeescapedwithabackslash,youmaysafelyplacea

backslashbeforeanypunctuationcharacter.Ontheotherhand,

notethatmanylettersandnumbershavespecialmeaningwhen

precededbyabackslash,soanylettersornumbersthatyou

wanttomatchliterallyshouldnotbeescapedwithabackslash.

Toincludeabackslashcharacterliterallyinaregular

expression,youmustescapeitwithabackslash,ofcourse.For

example,thefollowingregularexpressionmatchesanystring

thatincludesabackslash:/\\/.



11.1.2.CharacterClasses

Individualliteralcharacterscanbecombinedintocharacter

classesbyplacingthemwithinsquarebrackets.Acharacter

classmatchesanyonecharacterthatiscontainedwithinit.

Thus,theregularexpression/[abc]/matchesanyoneofthe

lettersa,b,orc.Negatedcharacterclassescanalsobedefined;

thesematchanycharacterexceptthosecontainedwithinthe

brackets.Anegatedcharacterclassisspecifiedbyplacinga

caret(^)asthefirstcharacterinsidetheleftbracket.The

regexp/[^abc]/matchesanyonecharacterotherthana,b,orc.

Characterclassescanuseahyphentoindicatearangeof

characters.Tomatchanyonelowercasecharacterfromthe

Latinalphabet,use/[a-z]/andtomatchanyletterordigitfrom

theLatinalphabet,use/[a-zA-Z0-9]/.

Becausecertaincharacterclassesarecommonlyused,the

JavaScriptregular-expressionsyntaxincludesspecialcharacters

andescapesequencestorepresentthesecommonclasses.For

example,\smatchesthespacecharacter,thetabcharacter,and

anyotherUnicodewhitespacecharacter;\Smatchesany

characterthatisnotUnicodewhitespace.Table11-2liststhese

charactersandsummarizescharacter-classsyntax.(Notethat

severalofthesecharacter-classescapesequencesmatchonly

ASCIIcharactersandhavenotbeenextendedtoworkwith

Unicodecharacters.Youcan,however,explicitlydefineyour



ownUnicodecharacterclasses;forexample,/[\u0400-\u04FF]/

matchesanyoneCyrilliccharacter.)

Table11-2.Regularexpressioncharacterclasses

Character Matches

[...]



Anyonecharacterbetweenthebrackets.



[^...]



Anyonecharacternotbetweenthebrackets.



.



AnycharacterexceptnewlineoranotherUnicodelineterminator.



\w



AnyASCIIwordcharacter.Equivalentto[a-zA-Z0-9_].



\W



AnycharacterthatisnotanASCIIwordcharacter.Equivalentto[^a-zA-Z0-9_].



\s



AnyUnicodewhitespacecharacter.



\S



AnycharacterthatisnotUnicodewhitespace.Notethat\wand\Sarenotthe

samething.



\d



AnyASCIIdigit.Equivalentto[0-9].



\D



AnycharacterotherthananASCIIdigit.Equivalentto[^0-9].



[\b]



Aliteralbackspace(specialcase).



Notethatthespecialcharacter-classescapescanbeusedwithin

squarebrackets.\smatchesanywhitespacecharacter,and\d

matchesanydigit,so/[\s\d]/matchesanyonewhitespace

characterordigit.Notethatthereisonespecialcase.Asyou'll

seelater,the\bescapehasaspecialmeaning.Whenused

withinacharacterclass,however,itrepresentsthebackspace

character.Thus,torepresentabackspacecharacterliterallyina



regularexpression,usethecharacterclasswithoneelement:

/[\b]/.



11.1.3.Repetition

Withtheregularexpressionsyntaxyou'velearnedsofar,you

candescribeatwo-digitnumberas/\d\d/andafour-digit

numberas/\d\d\d\d/.Butyoudon'thaveanywaytodescribe,

forexample,anumberthatcanhaveanynumberofdigitsora

stringofthreelettersfollowedbyanoptionaldigit.Thesemore

complexpatternsuseregular-expressionsyntaxthatspecifies

howmanytimesanelementofaregularexpressionmaybe

repeated.

Thecharactersthatspecifyrepetitionalwaysfollowthepattern

towhichtheyarebeingapplied.Becausecertaintypesof

repetitionarequitecommonlyused,therearespecial

characterstorepresentthesecases.Forexample,+matches

oneormoreoccurrencesofthepreviouspattern.Table11-3

summarizestherepetitionsyntax.

Table11-3.Regularexpressionrepetitioncharacters

Character Meaning

{n,m}



Matchthepreviousitematleastntimesbutnomorethanmtimes.



{n,}



Matchthepreviousitemnormoretimes.



{n}



Matchexactlynoccurrencesofthepreviousitem.



?



Matchzerooroneoccurrencesofthepreviousitem.Thatis,thepreviousitem

isoptional.Equivalentto{0,1}.



+



Matchoneormoreoccurrencesofthepreviousitem.Equivalentto{1,}.



*



Matchzeroormoreoccurrencesofthepreviousitem.Equivalentto{0,}.



Thefollowinglinesshowsomeexamples:



/\d{2,4}///Matchbetweentwoandfourdigits

/\w{3}\d?///Matchexactlythreewordcharactersandanopt

/\s+java\s+///Match"java"withoneormorespacesbeforean

/[^"]*///Matchzeroormorenon-quotecharacters



Becarefulwhenusingthe*and?repetitioncharacters.Since

thesecharactersmaymatchzeroinstancesofwhatever

precedesthem,theyareallowedtomatchnothing.For

example,theregularexpression/a*/actuallymatchesthestring

"bbbb"becausethestringcontainszerooccurrencesofthe

lettera!



11.1.3.1.Nongreedyrepetition

TherepetitioncharacterslistedinTable11-3matchasmany

timesaspossiblewhilestillallowinganyfollowingpartsofthe

regularexpressiontomatch.Wesaythatthisrepetitionis

"greedy."Itisalsopossible(inJavaScript1.5andlater;thisis

oneofthePerl5featuresnotimplementedinJavaScript1.2)to

specifythatrepetitionshouldbedoneinanongreedyway.

Simplyfollowtherepetitioncharacterorcharacterswitha

questionmark:??,+?,*?,oreven{1,5}?.Forexample,the

regularexpression/a+/matchesoneormoreoccurrencesofthe

lettera.Whenappliedtothestring"aaa",itmatchesallthree

letters.But/a+?/matchesoneormoreoccurrencesoftheletter

a,matchingasfewcharactersasnecessary.Whenappliedto

thesamestring,thispatternmatchesonlythefirstlettera.



Usingnongreedyrepetitionmaynotalwaysproducetheresults

youexpect.Considerthepattern/a*b/,whichmatcheszeroor

morelettera's,followedbytheletterb.Whenappliedtothe

string"aaab",itmatchestheentirestring.Nowlet'susethe

nongreedyversion:/a*?b/.Thisshouldmatchtheletterb

precededbythefewestnumberofa'spossible.Whenappliedto

thesamestring"aaab",youmightexpectittomatchonlythe

lastletterb.Infact,however,thispatternmatchestheentire

stringaswell,justlikethegreedyversionofthepattern.Thisis

becauseregular-expressionpatternmatchingisdonebyfinding

thefirstpositioninthestringatwhichamatchispossible.The

nongreedyversionofourpatterndoesmatchatthefirst

characterofthestring,sothismatchisreturned;matchesat

subsequentcharactersareneverevenconsidered.



11.1.4.Alternation,Grouping,andReferences

Theregular-expressiongrammarincludesspecialcharactersfor

specifyingalternatives,groupingsubexpressions,andreferring

toprevioussubexpressions.The|characterseparates

alternatives.Forexample,/ab|cd|ef/matchesthestring"ab"or

thestring"cd"orthestring"ef".And/\d{3}|[a-z]{4}/matches

eitherthreedigitsorfourlowercaseletters.

Notethatalternativesareconsideredlefttorightuntilamatch

isfound.Iftheleftalternativematches,therightalternativeis

ignored,evenifitwouldhaveproduceda"better"match.Thus,

whenthepattern/a|ab/isappliedtothestring"ab",itmatches

onlythefirstletter.

Parentheseshaveseveralpurposesinregularexpressions.One

purposeistogroupseparateitemsintoasinglesubexpression

sothattheitemscanbetreatedasasingleunitby|,*,+,?,

andsoon.Forexample,/java(script)?/matches"java"followed

bytheoptional"script".And/(ab|cd)+|ef)/matcheseitherthe

string"ef"oroneormorerepetitionsofeitherofthestrings



"ab"or"cd".

Anotherpurposeofparenthesesinregularexpressionsisto

definesubpatternswithinthecompletepattern.Whenaregular

expressionissuccessfullymatchedagainstatargetstring,itis

possibletoextracttheportionsofthetargetstringthat

matchedanyparticularparenthesizedsubpattern.(You'llsee

howthesematchingsubstringsareobtainedlaterinthe

chapter.)Forexample,supposeyouarelookingforoneormore

lowercaselettersfollowedbyoneormoredigits.Youmightuse

thepattern/[a-z]+\d+/.Butsupposeyouonlyreallycareabout

thedigitsattheendofeachmatch.Ifyouputthatpartofthe

patterninparentheses(/[a-z]+(\d+)/),youcanextractthedigits

fromanymatchesyoufind,asexplainedlater.

Arelateduseofparenthesizedsubexpressionsistoallowyouto

referbacktoasubexpressionlaterinthesameregular

expression.Thisisdonebyfollowinga\characterbyadigitor

digits.Thedigitsrefertothepositionoftheparenthesized

subexpressionwithintheregularexpression.Forexample,\1

refersbacktothefirstsubexpression,and\3referstothethird.

Notethat,becausesubexpressionscanbenestedwithinothers,

itisthepositionoftheleftparenthesisthatiscounted.Inthe

followingregularexpression,forexample,thenested

subexpression([Ss]cript)isreferredtoas\2:

/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/



Areferencetoaprevioussubexpressionofaregularexpression

doesnotrefertothepatternforthatsubexpressionbutrather

tothetextthatmatchedthepattern.Thus,referencescanbe

usedtoenforceaconstraintthatseparateportionsofastring

containexactlythesamecharacters.Forexample,thefollowing

regularexpressionmatcheszeroormorecharacterswithin

singleordoublequotes.However,itdoesnotrequirethe



openingandclosingquotestomatch(i.e.,bothsinglequotesor

bothdoublequotes):

/['"][^'"]*['"]/



Torequirethequotestomatch,useareference:

/(['"])[^'"]*\1/



The\1matcheswhateverthefirstparenthesizedsubexpression

matched.Inthisexample,itenforcestheconstraintthatthe

closingquotematchtheopeningquote.Thisregularexpression

doesnotallowsinglequoteswithindouble-quotedstringsor

viceversa.Itisnotlegaltouseareferencewithinacharacter

class,soyoucannotwrite:

/(['"])[^\1]*\1/



Laterinthischapter,you'llseethatthiskindofreferencetoa

parenthesizedsubexpressionisapowerfulfeatureofregularexpressionsearch-and-replaceoperations.

InJavaScript1.5(butnotJavaScript1.2),itispossibletogroup

itemsinaregularexpressionwithoutcreatinganumbered

referencetothoseitems.Insteadofsimplygroupingtheitems

within(and),beginthegroupwith(?:andenditwith).

Considerthefollowingpattern,forexample:

/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/



Here,thesubexpression(?:[Ss]cript)isusedsimplyfor

grouping,sothe?repetitioncharactercanbeappliedtothe

group.Thesemodifiedparenthesesdonotproduceareference,

sointhisregularexpression,\2referstothetextmatchedby

(fun\w*).

Table11-4summarizestheregular-expressionalternation,

grouping,andreferencingoperators.

Table11-4.Regularexpressionalternation,grouping,andreference

characters

Character Meaning



|



Alternation.Matcheitherthesubexpressiontotheleftorthesubexpressionto

theright.



(...)



Grouping.Groupitemsintoasingleunitthatcanbeusedwith*,+,?,|,andso

on.Alsorememberthecharactersthatmatchthisgroupforusewithlater

references.



(?:...)



Groupingonly.Groupitemsintoasingleunit,butdonotrememberthe

charactersthatmatchthisgroup.



\n



Matchthesamecharactersthatwerematchedwhengroupnumbernwasfirst

matched.Groupsaresubexpressionswithin(possiblynested)parentheses.

Groupnumbersareassignedbycountingleftparenthesesfromlefttoright.

Groupsformedwith(?:arenotnumbered.



11.1.5.SpecifyingMatchPosition

Asdescribedearlier,manyelementsofaregularexpression

matchasinglecharacterinastring.Forexample,\smatchesa

singlecharacterofwhitespace.Otherregularexpression

elementsmatchthepositionsbetweencharacters,insteadof

actualcharacters.\b,forexample,matchesawordboundarythe



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 11. Pattern Matching with Regular Expressions

Tải bản đầy đủ ngay(0 tr)

×