Tải bản đầy đủ - 0 (trang)
Chapter 5. Internet Tools and Techniques

Chapter 5. Internet Tools and Techniques

Tải bản đầy đủ - 0trang

protocolsarecentrallymeansoftransmittingtextsthatconform

toRFC-822,itsupdates,andassociatedRFCs.HTTPisfirstlya

meansoftransmittingHypertextMarkupLanguage(HTML)

messages.FollowingthepopularityoftheWorldWideWeb,

however,adizzyingarrayofothermessagetypesalsotravel

overHTTP:graphicandsoundsformats,proprietarymultimedia

plug-ins,executablebyte-codes(e.g.,JavaorJython),andalso

moretextualformatslikeXML-RPCandSOAP.

ThemostwidespreadtextformatontheInternetisalmost

certainlyhuman-readableandhuman-composednotesthat

followRFC-822andfriends.Thebasicformofsuchatextisa

seriesofheaders,eachbeginningalineandseparatedfroma

valuebyacolon;afteraheadercomesablankline;andafter

thatamessagebody.Inthesimplestcase,amessagebodyis

justfree-formtext;butMIMEheaderscanbeusedtonest

structuredanddiversecontentswithinamessagebody.Email

and(Usenet)discussiongroupsfollowthisformat.Evenother

protocols,likeHTTP,shareatopenvelopestructurewithRFC822.

AstrongsecondasInternettextformatsgoisHTML.Andin

thirdplaceafterthatisXML,invariousdialects.HTML,of

course,isthelinguafrancaoftheWeb;XMLisamoregeneral

standardfordefiningcustom"applications"or"dialects,"of

whichHTMLis(almost)one.Ineithercase,ratherthana

headercomposedofline-orientedfieldsfollowedbyabody,

HTML/XMLcontainhierarchicallynested"tags"witheachtag

indicatedbysurroundinganglebrackets.TagslikeHTML's

,,and
willbefamiliaralreadyto

mostreadersofthisbook.Inanycase,Pythonhasastrong

collectionoftoolsinitsstandardlibraryforparsingand

producingHTMLandXMLtextdocuments.InthecaseofXML,

someofthesetoolsassistwithspecificXMLdialects,while

lower-levelunderlyinglibrariestreatXMLsuigeneris.Insome

cases,third-partymodulesfillgapsinthestandardlibrary.



VariousPythonInternetmodulesarecoveredinvaryingdepth

inthischapter.EverytoolthatcomeswiththePythonstandard

libraryisexaminedatleastinsummary.ThosetoolsthatIfeel

areofgreatestimportancetoapplicationprogrammers(intext

processingapplications)aredocumentedinfairdetailand

accompaniedbyusageexamples,warnings,andtips.



AppendixB.ADataCompressionPrimer

SectionB.1.Introduction

SectionB.2.LosslessandLossyCompression

SectionB.3.ADataSetExample

SectionB.4.WhitespaceCompression

SectionB.5.Run-LengthEncoding

SectionB.6.HuffmanEncoding

SectionB.7.LempelZiv-Compression

SectionB.8.SolvingtheRightProblem

SectionB.9.ACustomTextCompressor

SectionB.10.References



B.1Introduction

SeeSection2.2.5fordetailsoncompressioncapabilities

includedinthePythonstandardlibrary.Thisappendixis

intendedtoprovidereaderswhoareunfamiliarwithdata

compressionabasicbackgroundonitstechniquesandtheory.

Thefinalsectionofthisappendixprovidesapractical

exampleaccompaniedbysomedemonstrationcodeofa

Huffman-inspiredcustomencoding.

Datacompressioniswidelyusedinavarietyofprogramming

contexts.Allpopularoperatingsystemsandprogramming

languageshavenumeroustoolsandlibrariesfordealingwith

datacompressionofvarioussorts.Therightchoiceof

compressiontoolsandlibrariesforaparticularapplication

dependsonthecharacteristicsofthedataandapplicationin

question:streamingversusfile;expectedpatternsand

regularitiesinthedata;relativeimportanceofCPUusage,

memoryusage,channeldemands,andstoragerequirements;

andotherfactors.

Justwhatisdatacompression,anyway?Theshortansweris

thatdatacompressionremovesredundancyfromdata;in

information-theoreticterms,compressionincreasestheentropy

ofthecompressedtext.Butthosestatementsareessentially

justtruebydefinition.Redundancycancomeinalotof

differentforms.Repeatedbitsequences(11111111)areone

type.Repeatedbytesequencesareanother(XXXXXXXX).But

moreoftenredundanciestendtocomeonalargerscale,either

regularitiesofthedatasettakenasawhole,orsequencesof

varyinglengthsthatarerelativelycommon.Basically,whatdata

compressionaimsatisfindingalgorithmictransformationsof

datarepresentationsthatwillproducemorecompact

representationsgiven"typical"datasets.Ifthisdescription

seemsabitcomplextounpack,readontofindsomemore

practicalillustrations.



B.2LosslessandLossyCompression

Thereareactuallytwofundamentallydifferent"styles"ofdata

compression:losslessandlossy.Thisappendixisgenerally

aboutlosslesscompressiontechniques,butthereaderwouldbe

servedtounderstandthedistinctionfirst.Losslesscompression

involvesatransformationoftherepresentationofadataset

suchthatitispossibletoreproduceexactlytheoriginaldataset

byperformingadecompressiontransformation.Lossy

compressionisarepresentationthatallowsyoutoreproduce

something"prettymuchlike"theoriginaldataset.Asaplusfor

thelossytechniques,theycanfrequentlyproducefarmore

compactdatarepresentationsthanlosslesscompression

techniquescan.Mostoftenlossycompressiontechniquesare

usedforimages,soundfiles,andvideo.Lossycompressionmay

beappropriateintheseareasinsofarashumanobserversdo

notperceivetheliteralbit-patternofadigitalimage/sound,but

rathermoregeneral"gestalt"featuresoftheunderlying

image/sound.

Fromthepointofviewof"normal"data,lossycompressionis

notanoption.Wedonotwantaprogramthatdoes"aboutthe

same"thingastheonewewrote.Wedonotwantadatabase

thatcontains"aboutthesame"kindofinformationaswhatwe

putintoit.Atleastnotformostpurposes(andIknowoffew

practicalusesoflossycompressionoutsideofwhatarealready

approximatemimeticrepresentationsoftherealworld,likes

imagesandsounds).



B.3ADataSetExample

Forpurposesofthisappendix,letusstartwithaspecific

hypotheticaldatarepresentation.Hereisaneasy-to-understand

example.InthetownofGreenfield,MA,thetelephoneprefixes

are772-,773-,and774-.(Fornon-USAreaders:IntheUSA,

localtelephonenumbersaresevendigitsandareconventionally

representedintheform###-####;prefixesareassignedin

geographicblocks.)Supposealsothatthefirstprefixisthe

mostlywidelyassignedofthethree.Thesuffixportionsmight

beanyotherdigits,infairlyequaldistribution.Thedatasetwe

areinterestedinis"thelistofallthetelephonenumbers

currentlyinactiveuse."Onecanimaginevariousreasonswhy

thismightbeinterestingforprogrammaticpurposes,butwe

neednotspecifythatherein.

Initially,thedatasetweareinterestedincomesinaparticular

datarepresentation:amulticolumnreport(perhapsgenerated

asoutputofsomequeryorcompilationprocess).Thefirstfew

linesofthisreportmightlooklike:

=============================================================

772-7628772-8601772-0113773-3429774-9833

773-4319774-3920772-0893772-9934773-8923

773-1134772-4930772-9390774-9992772-2314

[...]



B.4WhitespaceCompression

Whitespacecompressioncanbecharacterizedmostgenerallyas

"removingwhatwearenotinterestedin."Eventhoughthis

techniqueistechnicallyalossy-compressiontechnique,itisstill

usefulformanytypesofdatarepresentationswefindinthereal

world.Forexample,eventhoughHTMLisfarmorereadableina

texteditorifindentationandverticalspacingisadded,noneof

this"whitespace"shouldmakeanydifferencetohowtheHTML

documentisrenderedbyaWebbrowser.Ifyouhappentoknow

thatanHTMLdocumentisdestinedonlyforaWebbrowser(or

forarobot/spider),thenitmightbeagoodideatotakeoutall

thewhitespacetomakeittransmitfasterandoccupylessspace

instorage.Whatweremoveinwhitespacecompressionnever

reallyhadanyfunctionalpurposetostartwith.

Inthecaseofourexampleinthisarticle,itispossibleto

removequiteabitfromthedescribedreport.Therowof"="

acrossthetopaddsnothingfunctional,nordothe"-"within

numbers,northespacesbetweenthem.Theseareallusefulfor

apersonreadingtheoriginalreport,butdonotmatteroncewe

thinkofitasdata.Whatweremoveisnotpreciselywhitespace

intraditionalterms,buttheintentisthesame.

Whitespacecompressionisextremely"cheap"toperform.Itis

justamatterofreadingastreamofdataandexcludingafew

specificvaluesfromtheoutputstream.Inmanycases,no

"decompression"stepisinvolvedatall.Butevenwherewe

wouldwishtore-createsomethingclosetotheoriginal

somewheredownthedatastream,itshouldrequirelittlein

termsofCPUormemory.Whatwereproducemayormaynot

beexactlywhatwestartedwith,dependingonjustwhatrules

andconstraintswereinvolvedintheoriginal.AnHTMLpage

typedbyahumaninatexteditorwillprobablyhavespacing

thatisidiosyncratic.Thenagain,automatedtoolsoftenproduce

"reasonable"indentationandspacingofHTML.Inthecaseof



therigidreportformatinourexample,thereisnoreasonthat

theoriginalrepresentationcouldnotbepreciselyproducedbya

"decompressingformatter"downthedatastream.



B.5Run-LengthEncoding

Run-lengthencoding(RLE)isthesimplestwidelyusedlosslesscompressiontechnique.Likewhitespacecompression,itis

"cheap"especiallytodecode.Theideabehinditisthatmany

datarepresentationsconsistlargelyofstringsofrepeated

bytes.Ourexamplereportisonesuchdatarepresentation.It

beginswithastringofrepeated"=",andhasstringsofspaces

scatteredthroughit.Ratherthanrepresenteachcharacterwith

itsownbyte,RLEwill(sometimesoralways)haveaniteration

countfollowedbythecharactertoberepeated.

Ifrepeatedbytesarepredominantwithintheexpecteddata

representation,itmightbeadequateandefficienttoalways

havethealgorithmspecifyoneormorebytesofiterationcount,

followedbyonecharacter.However,ifone-lengthcharacter

stringsoccur,thesestringswillrequiretwo(ormore)bytesto

encodethem;thatis,0000000101011000mightbetheoutput

bitstreamrequiredforjustoneASCII"X"oftheinputstream.

Thenagain,ahundred"X"inarowwouldbeoutputas

0110010001011000,whichisquitegood.

WhatisfrequentlydoneinRLEvariantsistoselectivelyuse

bytestoindicateiteratorcountsandotherwisejusthavebytes

representthemselves.Atleastonebyte-valuehastobe

reservedtodothis,butthatcanbeescapedintheoutput,if

needed.Forexample,inourexampletelephone-numberreport,

weknowthateverythingintheinputstreamisplainASCII

characters.Specifically,theyallhavebitoneoftheirASCII

valueas0.WecouldusethisfirstASCIIbittoindicatethatan

iteratorcountwasbeingrepresentedratherthanrepresentinga

regularcharacter.Thenextsevenbitsoftheiteratorbytecould

beusedfortheiteratorcount,andthenextbytecould

representthecharactertoberepeated.So,forexample,we

couldrepresentthestring"YXXXXXXXX"as:



"Y"Iter(8)"X"

010011111000100001011000

Thisexampledoesnotshowhowtoescapeiteratorbyte-values,

nordoesitallowiterationofmorethan127occurrencesofa

character.VariationsonRLEdealwithissuessuchasthese,if

needed.



B.6HuffmanEncoding

Huffmanencodinglooksatthesymboltableofawholedata

set.Thecompressionisachievedbyfindingthe"weights"of

eachsymbolinthedataset.Somesymbolsoccurmore

frequentlythanothers,soHuffmanencodingsuggeststhatthe

frequentsymbolsneednotbeencodedusingasmanybitsas

theless-frequentsymbols.TherearevariationsonHuffmanstyleencoding,buttheoriginal(andfrequent)variationinvolves

lookingforthemostcommonsymbolandencodingitusingjust

onebit,say1.Ifyouencountera0,youknowyou'reonthe

waytoencodingalongervariablelengthsymbol.

Let'simagineweapplyHuffmanencodingtoourlocalphonebookexample(assumewehavealreadywhitespacecompressedthereport).Wemightget:

EncodingSymbol

17

0102

0113

000004

000015

000106

000118

001009

001010

001111

Ourinitialsymbolsetofdigitscouldalreadybe

straightforwardlyencoded(withno-compression)as4-bit

sequences(nibbles).TheHuffmanencodinggivenwilluseupto

5-bitsfortheworst-casesymbols,whichisobviouslyworse

thanthenibbleencoding.However,ourbestcasewilluseonly1

bit,andweknowthatourbestcaseisalsothemostfrequent

case,byhavingscannedthedataset.Sowemightencodea



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 5. Internet Tools and Techniques

Tải bản đầy đủ ngay(0 tr)

×