Tải bản đầy đủ - 0 (trang)
Chapter 16. Spell Checking, Word Counting, and Textual Analysis

Chapter 16. Spell Checking, Word Counting, and Textual Analysis

Tải bản đầy đủ - 0trang

Alcuin

ditroff

printerr

LaserWriter

PostScript

TranScript

Onlyonewordinthislistisactuallymisspelled.

OnmanyUnixsystems,youcansupplyalocaldictionaryfilesothatspell

recognizesspecialwordsandtermsspecifictoyoursiteorapplication.Afteryou

haverunspellandlookedthroughthewordlist,youcancreateafilecontaining

thewordsthatwerenotactualmisspellings.Thespellcommandwillcheckthis

listafterithasgonethroughitsowndictionary.Oncertainsystems,yourwordlistfilemustbesorted(Section22.1).

Ifyouaddedthespecialtermsinafilenameddict,youcouldspecifythatfileon

thecommandlineusingthe+option:



$spell+dictsample

printerr

Theoutputisreducedtothesinglemisspelling.

Thespellcommandwillmakesomeerrorsbasedonincorrectderivationof

spellingsfromtherootwordscontainedinitsdictionary.Ifyouunderstandhow

spellworks(Section15.4),youmaybelesssurprisedbysomeoftheseerrors.

Asstatedatthebeginning,spellisn'tonallUnixsystems,e.g.,Darwinand

FreeBSD.Intheseotherenvironments,checkfortheexistenceofalternative

spellchecking,suchasispell(Section16.2).Oryoucandownloadandinstall

theGNUversionofspellathttp://www.gnu.org/directory/spell.html.

—DDandSP



16.2CheckSpellingInteractivelywithispell



TheoriginalUnixspell-checkingprogram,spell(Section15.1),isfineforquick

checksofspellinginashortdocument,butitmakesyoucryoutforareal

spellchecker,whichnotonlyshowsyouthemisspelledwordsincontext,but

offerstochangethemforyou.

Gotohttp://examples.oreilly.com/upt3formoreinformationon:ispell

ispell,averyusefulprogramthat'sbeenportedtoUnixandenhancedoverthe

years,doesallthisandmore.Eitheritwillbepreinstalledoryou'llneedtoinstall

itforyourUnixversion.

Here'sthebasicusage:justaswithspell,youspellcheckadocumentbygiving

ispellafilename.Buttherethesimilaritiescease.ispelltakesoveryourscreenor

window,printingtwolinesofcontextatthebottomofthescreen.Ifyour

terminalcandoreversevideo,theoffendingwordishighlighted.Several

alternatepossibilitiesarepresentedintheupper-leftcornerofthescreen—any

wordinispell'sdictionarythatdiffersbyonlyoneletter,hasamissingorextra

letter,ortransposedletters.

Facedwithahighlightedword,youhaveeightchoices:

SPACE

Pressthespacebartoacceptthecurrentspelling.

A

TypeAtoacceptthecurrentspelling,nowandfortherestofthisinputfile.

I

TypeItoacceptthecurrentspellingnowandfortherestofthisinputfile

andalsotoinstructispelltoaddthewordtoyourprivatedictionary.By

default,theprivatedictionaryisthefile.ispell_wordsinyourhome

directory,butitcanbechangedwiththe-poptionorbysettingthe

environmentvariable(Section35.3)WORDLISTtothenameofsome

otherfile.Ifyouworkwithcomputers,thisoptionwillcomeinhandysince

weusesomuchjargoninthisbusiness!Itmakesalotmoresenseto"teach"



allthosewordstoispellthantokeepbeingofferedthemforpossible

correction.(Onegotcha:whenspecifyinganalternatefile,youmustusean

absolutepathname(Section1.14),orispellwilllookforthefileinyour

homedirectory.)

0-9

Typethedigitcorrespondingtooneofispell'salternativesuggestionstouse

thatspellinginstead.Forexample,ifyou'vetyped"hnadle,"asIdidwhen

writingthisarticle,ispellwilloffer0:handleintheupper-leftcorner

ofyourscreen.Typing0makesthechangeandmovesontothenext

misspelling,ifany.

R

TypeRifnoneofispell'sofferingsdothetrickandyouwanttobe

promptedforareplacement.Typeinthenewword,andthereplacementis

made.

L

TypeLifispelldidn'tmakeanyhelpfulsuggestionsandyou'reatalossas

tohowtospellthewordcorrectly.ispellwillpromptyouforalookup

string.Youcanuse*asawildcardcharacter(itappearstosubstitutefor

zerooronecharacters);ispellwillprintalistofmatchingwordsfromits

dictionary.

Q

TypeQtoquit,writinganychangesmadesofar,butignoringany

misspellingslaterintheinputfile.

X

TypeXtoquitwithoutwritinganychanges.

Butthat'snotall!ispellalsosavesacopyofyouroriginalfilewitha.bak

extension,justincaseyouregretanyofyourchanges.Ifyoudon'twantispell



making.bakfiles,invokeitwiththe-xoption.

Howaboutthis:ispellknowsaboutcapitalization.Italreadyknowsaboutproper

namesandalotofcommonacronyms—itcanevenhandlewordslike"TEX"

thathaveoddballcapitalization.SpeakingofTEX,ispellhasspecialmodesin

whichitrecognizesTEXconstructions.

Ifispellisn'tonyoursystembydefault,youshouldbeabletofindaninstallation

ofitpackagedinyoursystem'sownuniquesoftware-installationpackaging,

discussedinChapter40.

Inaddition,youcanalsolookforanewerspell-checkingutility,aspell,basedon

ispellbutwithimprovedprocessing.Thoughaspellisbeingconsidereda

replacementforispell,thelatterisstillthemostcommonlyfoundandusedof

thetwo.

—TOR



16.3HowDoISpellThatWord?

Areyouwritingadocumentandwanttocheckthespellingofawordbeforeyou

finish(ifyouaren'tusingawordprocessorwithautomaticspellingcorrection,

thatis)?AUnixsystemgivesyouseveralwaystodothis.

BecausethisisUnix,youcanuseanyoftheseapproacheswhenyou

writeascriptofyourown.



1. Ifyouaren'tsurewhichoftwopossiblespellingsisright,youcanuse

thespellcommandwithnoargumentstofindout.Typethenameofthe

command,followedbyaRETURN,thentypethealternativespellings

youareconsidering.PressCTRL-d(onalinebyitself)toendthelist.

Thespellcommandwillechobacktheword(s)inthelistthatit

considerstobeinerror:



$spell



misspelling

mispelling

CTRL-d

mispelling

Ifyou'reusingispell(Section16.2)ortheneweraspell,youneedtoaddtheaoption.Thepurposeofthisoptionistoletthespellerinteractwithother

programs;therearedetailsintheprograms'documentation.But,likemostUnix

filters,youcanalsolettheseprogramsreadawordfromstandardinputand

writetheirresponseonstandardoutput;itwilleithertellyouthatthespellingis

rightorgiveyoualistofsuggestions.aspellandispellwillusetheirlocal

dictionariesandimprovedspellingrules.

Asanexample,let'scheckthespellingofoutragousandwhutwithbothispell

andaspell:



$ispell-a

@(#)InternationalIspellVersion3.1.2010/10/95

outragouswhut

&outragous10:outrageous

&whut510:hut,shut,what,whet,whit



CTRL-d

$aspell-a

@(#)InternationalIspellVersion3.1.20(butreallyAs

outragouswhut

&outragous30:outrageous,outrages,outrage's

&whut510:what,whet,whit,hut,shut

CTRL-d

$

Whenthesespellersstart,theyprintaversionmessageandwaitforinput.Itype

thewordsIwanttocheckandpressRETURN.Thespellerreturnsoneresultline

foreachword:



Aresultof*meansthewordisspelledcorrectly.

Alinestartingwith&meansthespellerhassuggestions.Thenit

repeatstheword,thenumberofsuggestionsithasforthatword,

thecharacterpositionthatthewordhadontheinputline,and

finallythesuggestions.

Soispellsuggestedthatoutragousmightbeoutrageous.aspell

alsocameupwithoutragesandoutrage's.(I'dsaythatoutrage's

isbarelyaword.Becarefulwithaspell'ssuggestions.)Both

spellershadfivesuggestionsforwhut;thedifferencesare

interesting...

Aresultof#meanstherewerenosuggestions.

Afterprocessingaline,thespellersbothprintanemptyline.PressCTRL-dto

endinput.

Anotherwaytodothesamethingiswithlook(Section13.14).Withjustone

argument,looksearchesthesystemwordfile,/usr/dict/words,forwordsstarting

withthecharactersinthatoneargument.That'sagoodwaytocheckspellingor

findarelatedword:



%lookhelp

help

helpful

helpmate

lookusesits-dfoptionsautomaticallywhenitsearchesthewordlist.-dignores

anycharacterthatisn'taletter,number,spaceortab;-ftreatsupper-and

lowercaselettersthesame.

—JPandDD



16.4Insidespell



[Ifyouhaveispell(Section16.2),there'snotawholelotofreasonforusingspell

anymore.Notonlyisispellmorepowerful,it'saheckofaloteasiertoupdate

itsspellingdictionaries.Nonetheless,wedecidedtoincludethisarticle,because

itclarifiesthekindsofrulesthatspellcheckersgothroughtoexpandonthe

wordsintheirdictionaries.—TOR]

OnmanyUnixsystems,thedirectory/usr/lib/spellcontainsthemainprogram

invokedbythespellcommandalongwithauxiliaryprogramsanddatafiles.

Onsomesystems,thespellcommandisashellscriptthatpipesitsinputthrough

deroff-wandsort-u(Section22.6)toremoveformattingcodesandpreparea

sortedwordlist,onewordperline.Onothersystems,itisastandaloneprogram

thatdoesthesestepsinternally.Twoseparatespellinglistsaremaintained,one

forAmericanusageandoneforBritishusage(invokedwiththe-boptionto

spell).Theselists,hlistaandhlistb,cannotbereadorupdateddirectly.Theyare

compressedfiles,compiledfromalistofwordsrepresentedasnine-digithash

codes.(Hashcodingisaspecialtechniqueusedtosearchforinformation

quickly.)

Themainprograminvokedbyspellisspellprog.Itloadsthelistofhashcodes

fromeitherhlistaorhlistbintoatable,anditlooksforthehashcode

correspondingtoeachwordonthesortedwordlist.Thiseliminatesallwords(or

hashcodes)actuallyfoundinthespellinglist.Fortheremainingwords,

spellprogtriestoderivearecognizablewordbyperformingvariousoperations

onthewordstembasedonsuffixandprefixrules.Afewofthesemanipulations

follow:

-y+iness+ness-y+i+less+less-y+ies-t+ce-t+cy

Thenewwordscreatedasaresultofthesemanipulationswillbecheckedonce

moreagainstthespelltable.However,beforethestem-derivativerulesare

applied,theremainingwordsarecheckedagainstatableofhashcodesbuilt

fromthefilehstop.Thestoplistcontainstypicalmisspellingsthatstemderivativeoperationsmightallowtopass.Forinstance,themisspelledword

thierwouldbeconvertedintothyusingthesuffixrule-y+ier.Thehstopfile

accountsforasmanycasesofthistypeoferroraspossible.

Thefinaloutputconsistsofwordsnotfoundinthespelllist—evenafterthe



programtriedtosearchfortheirstems—andwordsthatwerefoundinthestop

list.

Youcangetabettersenseoftheserulesinactionbyusingthe-vor-xoption.

The-voptioneliminatesthelastlook-upinthetableandproducesalistofwords

thatarenotactuallyinthespellinglist,alongwithpossiblederivatives.Itallows

youtoseewhichwordswerefoundasaresultofstem-derivativeoperationsand

printstheruleused.(RefertothesamplefileinSection16.1.)



%spell-vsample

Alcuin

ditroff

LaserWriter

PostScript

printerr

TranScript

+outoutput

+suses

The-xoptionmakesspellbeginatthestem-derivativestageandprintsthe

variousattemptsitmakestofindthestemofeachword.



%spell-xsample

...

=into

=LaserWriter

=LaserWrite

=LaserWrit

=laserWriter

=laserWrite

=laserWrit

=output

=put

...

LaserWriter



...

Thestemisprecededbyanequalssign(=).Attheendoftheoutputarethe

wordswhosestemdoesnotappearinthespelllist.

Oneotherfileyoushouldknowaboutisspellhist.Onsomesystems,eachtime

yourunspell,theoutputisappendedthroughtee(Section43.8)intospellhist,in

effectcreatingalistofallthemisspelledorunrecognizedwordsforyoursite.

Thespellhistfileissomethingofa"garbage"filethatkeepsongrowing:you

willwanttoreduceitorremoveitperiodically.Toextractusefulinformation

fromthisspellhist,youmightusethesortanduniq-c(Section21.20)

commandstocompilealistofmisspelledwordsorspecialtermsthatoccurmost

frequently.Itispossibletoaddthesewordsbackintothebasicspelling

dictionary,butthisistoocomplexaprocesstodescribehere.It'sprobablyeasier

justtousealocalspellingdictionary(Section16.1).Evenbetter,useispell;not

onlyisitamorepowerfulspellingprogram,itismucheasiertoupdatethe

wordlistsituses(Section16.5).

—DD



16.5AddingWordstoispell'sDictionary

ispell(Section16.2)usestwolistsforspellingverification:amasterwordlist

andasupplementalpersonalwordlist.

Themasterwordlistforispellisnormallythefile

/usr/local/lib/ispell/ispell.hash,thoughthelocationofthefilecanvaryonyour

system.Thisisa"hashed"dictionaryfile.Thatis,ithasbeenconvertedtoa

condensed,program-readableformusingthebuildhashprogram(whichcomes

withispell)tospeedthespell-checkingprocess.

Thepersonalwordlistisnormallyafilecalled.ispell_englishor.ispell_wordsin

yourhomedirectory.(Youcanoverridethisdefaultwitheitherthe-pcommandlineoptionortheWORDLISTenvironmentvariable(Section35.3).)Thisfileis

simplyalistofwords,oneperline,soyoucanreadilyeditittoadd,alter,or

removeentries.Thepersonalwordlistisnormallyusedinadditiontothemaster

wordlist,soifawordusageispermittedbyeitherlistitisnotflaggedbyispell.



Custompersonalwordlistsareparticularlyusefulforcheckingdocumentsthat

usejargonorspecialtechnicalwordsthatarenotinthemasterwordlist,andfor

personalneedssuchasholdingthenamesofyourcorrespondents.Youmay

choosetokeepmorethanonecustomwordlisttomeetvariousspecial

requirements.

Youcanaddtoyourpersonalwordlistanytimeyouuseispell:simplyusetheI

commandtotellispellthattheworditofferedasamisspellingisactually

correct,andshouldbeaddedtothedictionary.Youcanalsoaddalistofwords

fromafileusingtheispell-a(Section16.3)option.Thewordsmustbeonetoa

line,butneednotbesorted.Eachwordtobeaddedmustbeprecededwithan

asterisk.(Why?Becauseispell-ahasotherfunctionsaswell.)So,forexample,

wecouldhaveaddedalistofUnixutilitynamestoourpersonaldictionariesall

atonce,ratherthanone-by-oneastheywereencounteredduringspellchecking.

Obviously,though,inanenvironmentwheremanypeopleareworkingwiththe

samesetoftechnicalterms,itdoesn'tmakesenseforeachindividualtoaddthe

samewordlisttohisownprivate.ispell_wordsfile.Itwouldmakefarmore

senseforagrouptoagreeonacommondictionaryforspecializedtermsand

alwaystosetWORDLISTtopointtothatcommondictionary.

Iftheprivatewordlistgetstoolong,youcancreatea"munched"wordlist.The

munchlistscriptthatcomeswithispellreducesthewordsinawordlisttoasetof

wordrootsandpermittedsuffixesaccordingtorulesdescribedintheispell(4)

referencepagethatwillbeinstalledwithispellfromtheCD-ROM[see

http://examples.oreilly.com/upt3].Thiscreatesamorecompactbutstilleditable

wordlist.

Anotheroptionistoprovideanalternativemasterspellinglistusingthe-d

option.Thishastwoproblems,though:

1. Themasterspellinglistshouldincludespellingsthatarealwaysvalid,

regardlessofcontext.Youdonotwanttooverloadyourmasterword

listwithtermsthatmightbemisspellingsinadifferentcontext.For

example,perlisapowerfulprogramminglanguage,butinother

contexts,perlmightbeamisspellingofpearl.Youmaywanttoplace

perlinasupplementalwordlistwhendocumentingUnixutilities,but

youprobablywouldn'twantitinthemasterwordlistunlessyouwere



documentingUnixutilitiesmostofthetimethatyouuseispell.

The-doptionmustpointtoahasheddictionaryfile.What'smore,youcannot

editahasheddictionary;youwillhavetoeditamasterwordlistanduse(orhave

thesystemadministratoruse)buildhashtohashthenewdictionarytooptimize

spellcheckerperformance.

Tobuildanewhashedwordlist,providebuildhashwithacompletelistofthe

wordsyouwantincluded,oneperline.(Thebuildhashutilitycanonlyprocessa

rawwordlist,notamunchedwordlist.)Thestandardsystemwordlist,

/usr/dict/wordsonmanysystems,canprovideagoodstartingpoint.Thisfileis

writableonlybythesystemadministratorandprobablyshouldn'tbechangedin

anycase.Somakeacopyofthisfile,andeditoraddtothecopy.After

processingthefilewithbuildhash,youcaneitherreplacethedefaultispell.hash

fileorpointtoyournewhashedfilewiththe-doption.

—TORandLK



16.6CountingLines,Words,andCharacters:wc

Thewc(wordcount)commandcountsthenumberoflines,words,and

charactersinthefilesyouspecify.(LikemostUnixutilities,wcreadsfromits

standardinputifyoudon'tspecifyafilename.)Forexample,thefileletterhas

120lines,734words,and4,297characters:



%wcletter

1207344297letter

Youcanrestrictwhatiscountedbyspecifyingtheoptions-l(countlinesonly),w(countwordsonly),and-c(countcharactersonly).Forexample,youcan

countthenumberoflinesinafile:



%wc-lletter

120letter

oryoucancountthenumberoffilesinadirectory:



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 16. Spell Checking, Word Counting, and Textual Analysis

Tải bản đầy đủ ngay(0 tr)

×
x