A. The Evolution of the awk Language
Tải bản đầy đủ - 0trang
AppendixA.TheEvolutionoftheawk
Language
ThisbookdescribestheGNUimplementationofawk,whichfollowsthePOSIX
specification.Manylongtimeawkuserslearnedawkprogrammingwiththeoriginalawk
implementationinVersion7Unix.(ThisimplementationwasthebasisforawkinBerkeley
Unix,through4.3-Reno.SubsequentversionsofBerkeleyUnix,and,forawhile,some
systemsderivedfrom4.4BSD-Lite,usedvariousversionsofgawkfortheirawk.)This
chapterbrieflydescribestheevolutionoftheawklanguage,withcross-referencestoother
partsofthebookwhereyoucanfindmoreinformation.
Tosavespace,wehaveomittedinformationonthehistoryoffeaturesingawkfromthis
edition.Youcanfinditintheonlinedocumentation.
MajorChangesBetweenV7andSVR3.1
TheawklanguageevolvedconsiderablybetweenthereleaseofVersion7Unix(1978)and
thenewversionthatwasfirstmadegenerallyavailableinSystemVRelease3.1(1987).
Thissectionsummarizesthechanges,withcross-referencestofurtherdetails:
Therequirementfor‘;’toseparaterulesonaline(seeawkStatementsVersusLines)
User-definedfunctionsandthereturnstatement(seeUser-DefinedFunctions)
Thedeletestatement(seeThedeleteStatement)
Thedo-whilestatement(seeThedo-whileStatement)
Thebuilt-infunctionsatan2(),cos(),sin(),rand(),andsrand()(seeNumeric
Functions)
Thebuilt-infunctionsgsub(),sub(),andmatch()(seeString-ManipulationFunctions)
Thebuilt-infunctionsclose()andsystem()(seeInput/OutputFunctions)
TheARGC,ARGV,FNR,RLENGTH,RSTART,andSUBSEPpredefinedvariables(see
PredefinedVariables)
Assignable$0(seeChangingtheContentsofaField)
Theconditionalexpressionusingtheternaryoperator‘?:’(seeConditional
Expressions)
Theexpression‘indxinarray’outsideofforstatements(seeReferringtoanArray
Element)
Theexponentiationoperator‘^’(seeArithmeticOperators)anditsassignmentoperator
form‘^=’(seeAssignmentExpressions)
C-compatibleoperatorprecedence,whichbreakssomeoldawkprograms(seeOperator
Precedence(HowOperatorsNest))
RegexpsasthevalueofFS(seeSpecifyingHowFieldsAreSeparated)andasthethird
argumenttothesplit()function(seeString-ManipulationFunctions),ratherthan
usingonlythefirstcharacterofFS
Dynamicregexpsasoperandsofthe‘~’and‘!~’operators(seeUsingDynamic
Regexps)
Theescapesequences‘\b’,‘\f’,and‘\r’(seeEscapeSequences)
Redirectionofinputforthegetlinefunction(seeExplicitInputwithgetline)
MultipleBEGINandENDrules(seeTheBEGINandENDSpecialPatterns)
Multidimensionalarrays(seeMultidimensionalArrays)
ChangesBetweenSVR3.1andSVR4
TheSystemVRelease4(1989)versionofUnixawkaddedthesefeatures(someofwhich
originatedingawk):
TheENVIRONarray(seePredefinedVariables)
Multiple-foptionsonthecommandline(seeCommand-LineOptions)
The-voptionforassigningvariablesbeforeprogramexecutionbegins(seeCommandLineOptions)
The--signalforterminatingcommand-lineoptions
The‘\a’,‘\v’,and‘\x’escapesequences(seeEscapeSequences)
Adefinedreturnvalueforthesrand()built-infunction(seeNumericFunctions)
Thetoupper()andtolower()built-instringfunctionsforcasetranslation(seeStringManipulationFunctions)
Acleanerspecificationforthe‘%c’format-controlletterintheprintffunction(see
Format-ControlLetters)
Theabilitytodynamicallypassthefieldwidthandprecision("%*.*d")intheargument
listofprintfandsprintf()(seeFormat-ControlLetters)
Theuseofregexpconstants,suchas/foo/,asexpressions,wheretheyareequivalent
tousingthematchingoperator,asin‘$0~/foo/’(seeUsingRegularExpression
Constants)
Processingofescapesequencesinsidecommand-linevariableassignments(see
Assigningvariablesonthecommandline)
ChangesBetweenSVR4andPOSIXawk
ThePOSIXCommandLanguageandUtilitiesstandardforawk(1992)introducedthe
followingchangesintothelanguage:
Theuseof-Wforimplementation-specificoptions(seeCommand-LineOptions)
TheuseofCONVFMTforcontrollingtheconversionofnumberstostrings(see
ConversionofStringsandNumbers)
Theconceptofanumericstringandtightercomparisonrulestogowithit(seeVariable
TypingandComparisonExpressions)
Theuseofpredefinedvariablesasfunctionparameternamesisforbidden(seeFunction
DefinitionSyntax)
Morecompletedocumentationofmanyofthepreviouslyundocumentedfeaturesofthe
language
In2012,anumberofextensionsthathadbeencommonlyavailableformanyyearswere
finallyaddedtoPOSIX.Theyare:
Thefflush()built-infunctionforflushingbufferedoutput(seeInput/Output
Functions)
Thenextfilestatement(seeThenextfileStatement)
Theabilitytodeleteallofanarrayatoncewith‘deletearray’(seeThedelete
Statement)
SeeCommonExtensionsSummaryforalistofcommonextensionsnotpermittedbythe
POSIXstandard.
The2008POSIXstandardcanbefoundonlineat
http://www.opengroup.org/onlinepubs/9699919799/.
ExtensionsinBrianKernighan’sawk
BrianKernighanhasmadehisversionavailableviahishomepage(seeOtherFreely
AvailableawkImplementations).
Thissectiondescribescommonextensionsthatoriginallyappearedinhisversionofawk:
The‘**’and‘**=’operators(seeArithmeticOperatorsandAssignmentExpressions)
Theuseoffuncasanabbreviationforfunction(seeFunctionDefinitionSyntax)
Thefflush()built-infunctionforflushingbufferedoutput(seeInput/Output
Functions)
SeeCommonExtensionsSummaryforafulllistoftheextensionsavailableinhisawk.
ExtensionsingawkNotinPOSIXawk
TheGNUimplementation,gawk,addsalargenumberoffeatures.Theycanallbedisabled
witheitherthe--traditionalor--posixoptions(seeCommand-LineOptions).
Anumberoffeatureshavecomeandgoneovertheyears.Thissectionsummarizesthe
additionalfeaturesoverPOSIXawkthatareinthecurrentversionofgawk.
Additionalpredefinedvariables:
TheARGIND,BINMODE,ERRNO,FIELDWIDTHS,FPAT,IGNORECASE,LINT,PROCINFO,RT,and
TEXTDOMAINvariables(seePredefinedVariables)
SpecialfilesinI/Oredirections:
The/dev/stdin,/dev/stdout,/dev/stderr,and/dev/fd/Nspecialfilenames(see
SpecialFilenamesingawk)
The/inet,/inet4,and‘/inet6’specialfilesforTCP/IPnetworkingusing‘|&’to
specifywhichversionoftheIPprotocoltouse(seeUsinggawkforNetwork
Programming)
Changesand/oradditionstothelanguage:
The‘\x’escapesequence(seeEscapeSequences)
FullsupportforbothPOSIXandGNUregexps(seeChapter3)
TheabilityforFSandforthethirdargumenttosplit()tobenullstrings(seeMaking
EachCharacteraSeparateField)
TheabilityforRStobearegexp(seeHowInputIsSplitintoRecords)
Theabilitytouseoctalandhexadecimalconstantsinawkprogramsourcecode(see
Octalandhexadecimalnumbers)
The‘|&’operatorfortwo-wayI/Otoacoprocess(seeTwo-WayCommunicationswith
AnotherProcess)
Indirectfunctioncalls(seeIndirectFunctionCalls)
Directoriesonthecommandlineproduceawarningandareskipped(seeDirectories
ontheCommandLine)
Newkeywords:
TheBEGINFILEandENDFILEspecialpatterns(seeTheBEGINFILEandENDFILE
SpecialPatterns)
Theswitchstatement(seeTheswitchStatement)
Changestostandardawkfunctions:
Theoptionalsecondargumenttoclose()thatallowsclosingoneendofatwo-way
pipetoacoprocess(seeTwo-WayCommunicationswithAnotherProcess)
POSIXcomplianceforgsub()andsub()with--posix
Thelength()functionacceptsanarrayargumentandreturnsthenumberofelements
inthearray(seeString-ManipulationFunctions)
Theoptionalthirdargumenttothematch()functionforcapturingtext-matching
subexpressionswithinaregexp(seeString-ManipulationFunctions)
Positionalspecifiersinprintfformatsformakingtranslationseasier(seeRearranging
printfArguments)
Thesplit()function’sadditionaloptionalfourthargument,whichisanarraytohold
thetextofthefieldseparators(seeString-ManipulationFunctions)
Additionalfunctionsonlyingawk:
Thegensub(),patsplit(),andstrtonum()functionsformorepowerfultext
manipulation(seeString-ManipulationFunctions)
Theasort()andasorti()functionsforsortingarrays(seeControllingArray
TraversalandArraySorting)
Themktime(),systime(),andstrftime()functionsforworkingwithtimestamps
(seeTimeFunctions)
Theand(),compl(),lshift(),or(),rshift(),andxor()functionsforbit
manipulation(seeBit-ManipulationFunctions)
Theisarray()functiontocheckifavariableisanarrayornot(seeGettingType
Information)
Thebindtextdomain(),dcgettext(),anddcngettext()functionsfor
internationalization(seeInternationalizingawkPrograms)
Changesand/oradditionsinthecommand-lineoptions:
TheAWKPATHenvironmentvariableforspecifyingapathsearchforthe-fcommandlineoption(seeCommand-LineOptions)
TheAWKLIBPATHenvironmentvariableforspecifyingapathsearchforthe-l
command-lineoption(seeCommand-LineOptions)
The-b,-c,-C,-d,-D,-e,-E,-g,-h,-i,-l,-L,-M,-n,-N,-o,-O,-p,-P,-r,-S,-t,
and-Vshortoptions.Also,theabilitytouseGNU-stylelong-namedoptionsthatstart
with--;andthe--assign,--bignum,--characters-as-bytes,--copyright,-debug,--dump-variables,--exec,--field-separator,--file,--gen-pot,--help,
--include,--lint,--lint-old,--load,--non-decimal-data,--optimize,--posix,
--pretty-print,--profile,--re-interval,--sandbox,--source,--traditional,
--use-lc-numeric,and--versionlongoptions(seeCommand-LineOptions)
Supportforthefollowingobsoletesystemswasremovedfromthecodeandthe
documentationforgawkversion4.0:
Amiga
Atari
BeOS
Cray
MIPSRiscOS
MS-DOSwiththeMicrosoftCompiler
MS-WindowswiththeMicrosoftCompiler
NeXT
SunOS3.x,Sun386(RoadRunner)
Tandem(non-POSIX)
PrestandardVAXCcompilerforVAX/VMS
GCCforVAXandAlphahasnotbeentestedforawhile.
Supportforthefollowingobsoletesystemwasremovedfromthecodeforgawkversion
4.1:
Ultrix
CommonExtensionsSummary
Thefollowingtablesummarizesthecommonextensionssupportedbygawk,Brian
Kernighan’sawk,andmawk,thethreemostwidelyusedfreelyavailableversionsofawk
(seeOtherFreelyAvailableawkImplementations).
Feature
BWKawk mawk gawk Nowstandard
‘\x’escapesequence
✓
✓
✓
FSasnullstring
✓
✓
✓
/dev/stdinspecialfile
✓
✓
✓
/dev/stdoutspecialfile
✓
✓
✓
/dev/stderrspecialfile
✓
✓
✓
deletewithoutsubscript
✓
✓
✓
✓
fflush()function
✓
✓
✓
✓
length()ofanarray
✓
✓
✓
nextfilestatement
✓
✓
✓
✓
**and**=operators
✓
✓
funckeyword
✓
✓
BINMODEvariable
✓
✓
RSasregexp
✓
✓
Time-relatedfunctions
✓
✓
RegexpRangesandLocales:ALongSadStory
Thissectiondescribestheconfusinghistoryofrangeswithinregularexpressionsandtheir
interactionswithlocales,andhowthisaffecteddifferentversionsofgawk.
TheoriginalUnixtoolsthatworkedwithregularexpressionsdefinedcharacterranges
(suchas‘[a-z]’)tomatchanycharacterbetweenthefirstcharacterintherangeandthe
lastcharacterintherange,inclusive.Orderingwasbasedonthenumericvalueofeach
characterinthemachine’snativecharacterset.Thus,onASCII-basedsystems,‘[a-z]’
matchedallthelowercaseletters,andonlythelowercaseletters,asthenumericvaluesfor
thelettersfrom‘a’through‘z’werecontiguous.(OnanEBCDICsystem,therange‘[az]’includesadditionalnonalphabeticcharactersaswell.)
AlmostallintroductoryUnixliteratureexplainedrangeexpressionsasworkinginthis
fashion,andinparticular,wouldteachthatthe“correct”waytomatchlowercaseletters
waswith‘[a-z]’,andthat‘[A-Z]’wasthe“correct”waytomatchuppercaseletters.And
indeed,thiswastrue.[104]
The1992POSIXstandardintroducedtheideaoflocales(seeWhereYouAreMakesa
Difference).Becausemanylocalesincludeotherlettersbesidestheplain26lettersofthe
Englishalphabet,thePOSIXstandardaddedcharacterclasses(seeUsingBracket
Expressions)asawaytomatchdifferentkindsofcharactersbesidesthetraditionalonesin
theASCIIcharacterset.
However,thestandardchangedtheinterpretationofrangeexpressions.Inthe"C"and
"POSIX"locales,arangeexpressionlike‘[a-dx-z]’isstillequivalentto‘[abcdxyz]’,as
inASCII.Butoutsidethoselocales,theorderingwasdefinedtobebasedoncollation
order.
Whatdoesthatmean?Inmanylocales,‘A’and‘a’arebothlessthan‘B’.Inotherwords,
theselocalessortcharactersindictionaryorder,and‘[a-dx-z]’istypicallynotequivalent
to‘[abcdxyz]’;instead,itmightbeequivalentto‘[ABCXYabcdxyz]’,forexample.
Thispointneedstobeemphasized:muchliteratureteachesthatyoushoulduse‘[a-z]’to
matchalowercasecharacter.Butonsystemswithnon-ASCIIlocales,thisalsomatchesall
oftheuppercasecharactersexcept‘A’or‘Z’!Thiswasacontinuouscauseofconfusion,
evenwellintothetwenty-firstcentury.
Todemonstratetheseissues,thefollowingexampleusesthesub()function,whichdoes
textreplacement(seeString-ManipulationFunctions).Here,theintentistoremove
trailinguppercasecharacters:
$echosomething1234abc|gawk-3.1.8'{sub("[A-Z]*$","");print}'
something1234a
Thisoutputisunexpected,asthe‘bc’attheendof‘something1234abc’shouldnot
normallymatch‘[A-Z]*’.Thisresultisduetothelocalesetting(andthusyoumaynotsee
itonyoursystem).
Similarconsiderationsapplytootherranges.Forexample,‘["-/]’isperfectlyvalidin
ASCII,butisnotvalidinmanyUnicodelocales,suchasen_US.UTF-8.
Earlyversionsofgawkusedregexpmatchingcodethatwasnotlocale-aware,soranges
hadtheirtraditionalinterpretation.
Whengawkswitchedtousinglocale-awareregexpmatchers,theproblemsbegan;
especiallyasbothGNU/LinuxandcommercialUnixvendorsstartedimplementingnonASCIIlocales,andmakingthemthedefault.Perhapsthemostfrequentlyaskedquestion
becamesomethinglike,“Whydoes‘[A-Z]’matchlowercaseletters?!?”
Thissituationexistedforcloseto10years,ifnotmore,andthegawkmaintainergrew
wearyoftryingtoexplainthatgawkwasbeingnicelystandards-compliant,andthatthe
issuewasintheuser’slocale.Duringthedevelopmentofversion4.0,hemodifiedgawkto
alwaystreatrangesintheoriginal,pre-POSIXfashion,unless--posixwasused(see
Command-LineOptions).[105]
Fortunately,shortlybeforethefinalreleaseofgawk4.0,themaintainerlearnedthatthe
2008standardhadchangedthedefinitionofranges,suchthatoutsidethe"C"and"POSIX"
locales,themeaningofrangeexpressionswasundefined.[106]
Byusingthislovelytechnicalterm,thestandardgiveslicensetoimplementorsto
implementrangesinwhateverwaytheychoose.Thegawkmaintainerchosetoapplythe
pre-POSIXmeaningbothwiththedefaultregexpmatchingandwhen--traditionalor-posixareused.InallcasesgawkremainsPOSIX-compliant.