Tải bản đầy đủ - 0 (trang)
III. Moving Beyond Standard awk with gawk

III. Moving Beyond Standard awk with gawk

Tải bản đầy đủ - 0trang

PartIII.MovingBeyondStandardawkwithgawk

PartIIIfocusesonfeaturesspecifictogawk.Itcontainsthefollowingchapters:

Chapter12,AdvancedFeaturesofgawk

Chapter13,Internationalizationwithgawk

Chapter14,DebuggingawkPrograms

Chapter15,ArithmeticandArbitrary-PrecisionArithmeticwithgawk

Chapter16,WritingExtensionsforgawk



Chapter12.AdvancedFeaturesofgawk

Writedocumentationasifwhoeverreadsitisaviolentpsychopathwhoknowswhereyoulive.

—SteveEnglish,asquotedbyPeterLangston



Thischapterdiscussesadvancedfeaturesingawk.It’sabitofa“grabbag”ofitemsthat

areotherwiseunrelatedtoeachother.First,welookatacommand-lineoptionthatallows

gawktorecognizenondecimalnumbersininputdata,notjustinawkprograms.Then,

gawk’sspecialfeaturesforsortingarraysarepresented.Next,two-wayI/O,discussed

brieflyinearlierpartsofthisbook,isdescribedinfulldetail,alongwiththebasicsof

TCP/IPnetworking.Finally,weseehowgawkcanprofileanawkprogram,makingit

possibletotuneitforperformance.

Additionaladvancedfeaturesarediscussedinseparatechaptersoftheirown:

Chapter13,Internationalizationwithgawk,discusseshowtointernationalizeyourawk

programs,sothattheycanspeakmultiplenationallanguages.

Chapter14,DebuggingawkPrograms,describesgawk’sbuilt-incommand-line

debuggerfordebuggingawkprograms.

Chapter15,ArithmeticandArbitrary-PrecisionArithmeticwithgawk,describeshow

youcanusegawktoperformarbitrary-precisionarithmetic.

Chapter16,WritingExtensionsforgawk,discussestheabilitytodynamicallyaddnew

built-infunctionstogawk.



AllowingNondecimalInputData

Ifyourungawkwiththe--non-decimal-dataoption,youcanhavenondecimalvaluesin

yourinputdata:

$echo01231230x123|

>gawk--non-decimal-data'{printf"%d,%d,%d\n",$1,$2,$3}'

83,123,291



Forthisfeaturetowork,writeyourprogramsothatgawktreatsyourdataasnumeric:

$echo01231230x123|gawk'{print$1,$2,$3}'

01231230x123



Theprintstatementtreatsitsexpressionsasstrings.Althoughthefieldscanactas

numberswhennecessary,theyarestillstrings,soprintdoesnottrytotreatthem

numerically.Youneedtoaddzerotoafieldtoforceittobetreatedasanumber.For

example:

$echo01231230x123|gawk--non-decimal-data'

>{print$1,$2,$3

>print$1+0,$2+0,$3+0}'

01231230x123

83123291



Becauseitiscommontohavedecimaldatawithleadingzeros,andbecauseusingthis

facilitycouldleadtosurprisingresults,thedefaultistoleaveitdisabled.Ifyouwantit,

youmustexplicitlyrequestit.

CAUTION

Useofthisoptionisnotrecommended.Itcanbreakoldprogramsverybadly.Instead,usethestrtonum()functionto

convertyourdata(seeString-ManipulationFunctions).Thismakesyourprogramseasiertowriteandeasiertoread,

andleadstolesssurprisingresults.

Thisoptionmaydisappearinafutureversionofgawk.



ControllingArrayTraversalandArraySorting

gawkletsyoucontroltheorderinwhicha‘for(indxinarray)’looptraversesanarray.



Inaddition,twobuilt-infunctions,asort()andasorti(),letyousortarraysbasedonthe

arrayvaluesandindices,respectively.Thesetwofunctionsalsoprovidecontroloverthe

sortingcriteriausedtoordertheelementsduringsorting.



ControllingArrayTraversal

Bydefault,theorderinwhicha‘for(indxinarray)’loopscansanarrayisnot

defined;itisgenerallybasedupontheinternalimplementationofarraysinsideawk.

Often,though,itisdesirabletobeabletoloopovertheelementsinaparticularorderthat

you,theprogrammer,choose.gawkletsyoudothis.

UsingPredefinedArrayScanningOrderswithgawkdescribeshowyoucanassignspecial,

predefinedvaluestoPROCINFO["sorted_in"]inordertocontroltheorderinwhichgawk

traversesanarrayduringaforloop.

Inaddition,thevalueofPROCINFO["sorted_in"]canbeafunctionname.[78]Thisletsyou

traverseanarraybasedonanycustomcriterion.Thearrayelementsareorderedaccording

tothereturnvalueofthisfunction.Thecomparisonfunctionshouldbedefinedwithat

leastfourarguments:

functioncomp_func(i1,v1,i2,v2)

{

compareelements1and2insomefashion

return<0;0;or>0

}



Here,i1andi2aretheindices,andv1andv2arethecorrespondingvaluesofthetwo

elementsbeingcompared.Eitherv1orv2,orboth,canbearraysifthearraybeing

traversedcontainssubarraysasvalues.(SeeArraysofArraysformoreinformationabout

subarrays.)Thethreepossiblereturnvaluesareinterpretedasfollows:

comp_func(i1,v1,i2,v2)<0



Indexi1comesbeforeindexi2duringlooptraversal.

comp_func(i1,v1,i2,v2)==0



Indicesi1andi2cometogether,buttherelativeorderwithrespecttoeachotheris

undefined.

comp_func(i1,v1,i2,v2)>0



Indexi1comesafterindexi2duringlooptraversal.

Ourfirstcomparisonfunctioncanbeusedtoscananarrayinnumericalorderofthe

indices:

functioncmp_num_idx(i1,v1,i2,v2)

{

#numericalindexcomparison,ascendingorder

return(i1-i2)

}



Oursecondfunctiontraversesanarraybasedonthestringorderoftheelementvalues

ratherthanbyindices:



functioncmp_str_val(i1,v1,i2,v2)

{

#stringvaluecomparison,ascendingorder

v1=v1""

v2=v2""

if(v1
return-1

return(v1!=v2)

}



Thethirdcomparisonfunctionmakesallnumbers,andnumericstringswithoutany

leadingortrailingspaces,comeoutfirstduringlooptraversal:

functioncmp_num_str_val(i1,v1,i2,v2,n1,n2)

{

#numbersbeforestringvaluecomparison,ascendingorder

n1=v1+0

n2=v2+0

if(n1==v1)

return(n2==v2)?(n1-n2):-1

elseif(n2==v2)

return1

return(v1
}



Hereisamainprogramtodemonstratehowgawkbehavesusingeachoftheprevious

functions:

BEGIN{

data["one"]=10

data["two"]=20

data[10]="one"

data[100]=100

data[20]="two"

f[1]="cmp_num_idx"

f[2]="cmp_str_val"

f[3]="cmp_num_str_val"

for(i=1;i<=3;i++){

printf("Sortfunction:%s\n",f[i])

PROCINFO["sorted_in"]=f[i]

for(jindata)

printf("\tdata[%s]=%s\n",j,data[j])

print""

}

}



Herearetheresultswhentheprogramisrun:

$gawk-fcompdemo.awk

Sortfunction:cmp_num_idxSortbynumericindex

data[two]=20

data[one]=10Bothstringsarenumericallyzero

data[10]=one

data[20]=two

data[100]=100

Sortfunction:cmp_str_valSortbyelementvaluesasstrings

data[one]=10

data[100]=100String100islessthanstring20

data[two]=20

data[10]=one

data[20]=two

Sortfunction:cmp_num_str_valSortallnumericvaluesbeforeallstrings

data[one]=10

data[two]=20

data[100]=100

data[10]=one

data[20]=two



ConsidersortingtheentriesofaGNU/Linuxsystempasswordfileaccordingtologin

name.Thefollowingprogramsortsrecordsbyaspecificfieldpositionandcanbeusedfor

thispurpose:



#passwd-sort.awk---simpleprogramtosortbyfieldposition

#fieldpositionisspecifiedbytheglobalvariablePOS

functioncmp_field(i1,v1,i2,v2)

{

#comparisonbyvalue,asstring,andascendingorder

returnv1[POS]
}

{

for(i=1;i<=NF;i++)

a[NR][i]=$i

}

END{

PROCINFO["sorted_in"]="cmp_field"

if(POS<1||POS>NF)

POS=1

for(iina){

for(j=1;j<=NF;j++)

printf("%s%c",a[i][j],j
print""

}

}



Thefirstfieldineachentryofthepasswordfileistheuser’sloginname,andthefieldsare

separatedbycolons.Eachrecorddefinesasubarray,witheachfieldasanelementinthe

subarray.Runningtheprogramproducesthefollowingoutput:

$gawk-vPOS=1-F:-fsort.awk/etc/passwd

adm:x:3:4:adm:/var/adm:/sbin/nologin

apache:x:48:48:Apache:/var/www:/sbin/nologin

avahi:x:70:70:Avahidaemon:/:/sbin/nologin





Thecomparisonshouldnormallyalwaysreturnthesamevaluewhengivenaspecificpair

ofarrayelementsasitsarguments.Ifinconsistentresultsarereturned,thentheorderis

undefined.Thisbehaviorcanbeexploitedtointroducerandomorderintootherwise

seeminglyordereddata:

functioncmp_randomize(i1,v1,i2,v2)

{

#randomorder(caution:thismayneverterminate!)

return(2-4*rand())

}



Asalreadymentioned,theorderoftheindicesisarbitraryiftwoelementscompareequal.

Thisisusuallynotaproblem,butlettingthetiedelementscomeoutinarbitraryordercan

beanissue,especiallywhencomparingitemvalues.Thepartialorderingoftheequal

elementsmaychangethenexttimethearrayistraversed,ifotherelementsareaddedtoor

removedfromthearray.Onewaytoresolvetieswhencomparingelementswithotherwise

equalvaluesistoincludetheindicesinthecomparisonrules.Notethatdoingthismay

makethelooptraversallessefficient,soconsideritonlyifnecessary.Thefollowing

comparisonfunctionsforceadeterministicorder,andarebasedonthefactthatthe(string)

indicesoftwoelementsareneverequal:

functioncmp_numeric(i1,v1,i2,v2)

{

#numericalvalue(andindex)comparison,descendingorder

return(v1!=v2)?(v2-v1):(i2-i1)

}

functioncmp_string(i1,v1,i2,v2)

{

#stringvalue(andindex)comparison,descendingorder

v1=v1i1

v2=v2i2



return(v1>v2)?-1:(v1!=v2)

}



Acustomcomparisonfunctioncanoftensimplifyorderedlooptraversal,andtheskyis

reallythelimitwhenitcomestodesigningsuchafunction.

Whenstringcomparisonsaremadeduringasort,eitherforelementvalueswhereoneor

botharen’tnumbers,orforelementindiceshandledasstrings,thevalueofIGNORECASE

(seePredefinedVariables)controlswhetherthecomparisonstreatcorrespondingupperandlowercaselettersasequivalentordistinct.

Anotherpointtokeepinmindisthatinthecaseofsubarrays,theelementvaluescan

themselvesbearrays;aproductioncomparisonfunctionshouldusetheisarray()

function(seeGettingTypeInformation)tocheckforthis,andchooseadefinedsorting

orderforsubarrays.

AllsortingbasedonPROCINFO["sorted_in"]isdisabledinPOSIXmode,becausethe

PROCINFOarrayisnotspecialinthatcase.

Asasidenote,sortingthearrayindicesbeforetraversingthearrayhasbeenreportedto

adda15%to20%overheadtotheexecutiontimeofawkprograms.Forthisreason,sorted

arraytraversalisnotthedefault.



SortingArrayValuesandIndiceswithgawk

Inmostawkimplementations,sortinganarrayrequireswritingasort()function.Thiscan

beeducationalforexploringdifferentsortingalgorithms,butusuallythat’snotthepointof

theprogram.gawkprovidesthebuilt-inasort()andasorti()functions(seeStringManipulationFunctions)forsortingarrays.Forexample:

populatethearraydata

n=asort(data)

for(i=1;i<=n;i++)

dosomethingwithdata[i]



Afterthecalltoasort(),thearraydataisindexedfrom1tosomenumbern,thetotal

numberofelementsindata.(Thiscountisasort()’sreturnvalue.)data[1]≤data[2]≤

data[3],andsoon.Thedefaultcomparisonisbasedonthetypeoftheelements(see

VariableTypingandComparisonExpressions).Allnumericvaluescomebeforeallstring

values,whichinturncomebeforeallsubarrays.

Animportantsideeffectofcallingasort()isthatthearray’soriginalindicesare

irrevocablylost.Asthisisn’talwaysdesirable,asort()acceptsasecondargument:

populatethearraysource

n=asort(source,dest)

for(i=1;i<=n;i++)

dosomethingwithdest[i]



Inthiscase,gawkcopiesthesourcearrayintothedestarrayandthensortsdest,

destroyingitsindices.However,thesourcearrayisnotaffected.

Often,what’sneededistosortonthevaluesoftheindicesinsteadofthevaluesofthe

elements.Todothat,usetheasorti()function.Theinterfaceandbehaviorareidentical

tothatofasort(),exceptthattheindexvaluesareusedforsortingandbecomethevalues

oftheresultarray:

{source[$0]=some_func($0)}



END{

n=asorti(source,dest)

for(i=1;i<=n;i++){

Workwithsortedindicesdirectly:

dosomethingwithdest[i]



Accessoriginalarrayviasortedindices:

dosomethingwithsource[dest[i]]

}

}



Sofar,sogood.Nowitstartstogetinteresting.Bothasort()andasorti()acceptathird

stringargumenttocontrolcomparisonofarrayelements.Whenweintroducedasort()

andasorti()inString-ManipulationFunctions,weignoredthisthirdargument;however,

nowisthetimetodescribehowthisargumentaffectsthesetwofunctions.

Basically,thethirdargumentspecifieshowthearrayistobesorted.Therearetwo

possibilities.AswithPROCINFO["sorted_in"],thisargumentmaybeoneofthe

predefinednamesthatgawkprovides(seeUsingPredefinedArrayScanningOrderswith

gawk),oritmaybethenameofauser-definedfunction(seeControllingArrayTraversal).

Inthelattercase,thefunctioncancompareelementsinanywayitchooses,takinginto

accountjusttheindices,justthevalues,orboth.Thisisextremelypowerful.

Oncethearrayissorted,asort()takesthevaluesintheirfinalorderandusesthemtofill

intheresultarray,whereasasorti()takestheindicesintheirfinalorderandusesthemto

fillintheresultarray.

NOTE

Copyingarrayindicesandelementsisn’texpensiveintermsofmemory.Internally,gawkmaintainsreferencecounts

todata.Forexample,whenasort()copiesthefirstarraytothesecondone,thereisonlyonecopyoftheoriginal

arrayelements’data,eventhoughbotharraysusethevalues.



BecauseIGNORECASEaffectsstringcomparisons,thevalueofIGNORECASEalsoaffects

sortingforbothasort()andasorti().Notealsothatthelocale’ssortingorderdoesnot

comeintoplay;comparisonsarebasedoncharactervaluesonly.[79]



Two-WayCommunicationswithAnotherProcess

Itisoftenusefultobeabletosenddatatoaseparateprogramforprocessingandthenread

theresult.Thiscanalwaysbedonewithtemporaryfiles:

#Writethedataforprocessing

tempfile=("mydata."PROCINFO["pid"])

while(notdonewithdata)

printdata|("subprogram>"tempfile)

close("subprogram>"tempfile)

#Readtheresults,removetempfilewhendone

while((getlinenewdata0)

processnewdataappropriately

close(tempfile)

system("rm"tempfile)



Thisworks,butnotelegantly.Amongotherthings,itrequiresthattheprogramberunina

directorythatcannotbesharedamongusers;forexample,/tmpwillnotdo,asanother

usermighthappentobeusingatemporaryfilewiththesamename.[80]

However,withgawk,itispossibletoopenatwo-waypipetoanotherprocess.Thesecond

processistermedacoprocess,asitrunsinparallelwithgawk.Thetwo-wayconnectionis

createdusingthe‘|&’operator(borrowedfromtheKornshell,ksh):[81]

do{

printdata|&"subprogram"

"subprogram"|&getlineresults

}while(datalefttoprocess)

close("subprogram")



ThefirsttimeanI/Ooperationisexecutedusingthe‘|&’operator,gawkcreatesatwo-way

pipelinetoachildprocessthatrunstheotherprogram.Outputcreatedwithprintor

printfiswrittentotheprogram’sstandardinput,andoutputfromtheprogram’sstandard

outputcanbereadbythegawkprogramusinggetline.Asisthecasewithprocesses

startedby‘|’,thesubprogramcanbeanyprogram,orpipelineofprograms,thatcanbe

startedbytheshell.

Therearesomecautionaryitemstobeawareof:

Asthecodeinsidegawkcurrentlystands,thecoprocess’sstandarderrorgoestothe

sameplacethattheparentgawk’sstandarderrorgoes.Itisnotpossibletoreadthe

child’sstandarderrorseparately.

I/Obufferingmaybeaproblem.gawkautomaticallyflushesalloutputdownthepipeto

thecoprocess.However,ifthecoprocessdoesnotflushitsoutput,gawkmayhang

whendoingagetlineinordertoreadthecoprocess’sresults.Thiscouldleadtoa

situationknownasdeadlock,whereeachprocessiswaitingfortheotheronetodo

something.

Itispossibletoclosejustoneendofthetwo-waypipetoacoprocess,bysupplyinga

secondargumenttotheclose()functionofeither"to"or"from"(seeClosingInputand

OutputRedirections).Thesestringstellgawktoclosetheendofthepipethatsendsdatato

thecoprocessortheendthatreadsfromit,respectively.

Thisisparticularlynecessaryinordertousethesystemsortutilityaspartofacoprocess;

sortmustreadallofitsinputdatabeforeitcanproduceanyoutput.Thesortprogram

doesnotreceiveanend-of-fileindicationuntilgawkclosesthewriteendofthepipe.



Whenyouhavefinishedwritingdatatothesortutility,youcanclosethe"to"endofthe

pipe,andthenstartreadingsorteddataviagetline.Forexample:

BEGIN{

command="LC_ALL=Csort"

n=split("abcdefghijklmnopqrstuvwxyz",a,"")

for(i=n;i>0;i--)

printa[i]|&command

close(command,"to")

while((command|&getlineline)>0)

print"got",line

close(command)

}



Thisprogramwritesthelettersofthealphabetinreverseorder,oneperline,downthe

two-waypipetosort.Itthenclosesthewriteendofthepipe,sothatsortreceivesan

end-of-fileindication.Thiscausessorttosortthedataandwritethesorteddatabackto

thegawkprogram.Onceallofthedatahasbeenread,gawkterminatesthecoprocessand

exits.

Asasidenote,theassignment‘LC_ALL=C’inthesortcommandensurestraditionalUnix

(ASCII)sortingfromsort.Thisisnotstrictlynecessaryhere,butit’sgoodtoknowhow

todothis.

Youmayalsousepseudo-ttys(ptys)fortwo-waycommunicationinsteadofpipes,ifyour

systemsupportsthem.Thisisdoneonaper-commandbasis,bysettingaspecialelement

inthePROCINFOarray(seeBuilt-inVariablesThatConveyInformation),likeso:

command="sort-nr"#command,saveinconveniencevariable

PROCINFO[command,"pty"]=1#updatePROCINFO

print…|&command#starttwo-waypipe





Usingptysusuallyavoidsthebufferdeadlockissuesdescribedearlier,atsomelossin

performance.Ifyoursystemdoesnothaveptys,orifallthesystem’sptysareinuse,gawk

automaticallyfallsbacktousingregularpipes.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

III. Moving Beyond Standard awk with gawk

Tải bản đầy đủ ngay(0 tr)

×