Tải bản đầy đủ - 0 (trang)
Hack 19. Spider the Yahoo! Catalog

Hack 19. Spider the Yahoo! Catalog

Tải bản đầy đủ - 0trang

thematicsites,eachwithitsowndocumentlayout,naming

conventions,andpeculiaritiesinpagedesignandURLpatterns.

Forexample,ifyouchecklinkstothesamedirectorysectionon

differentYahoo!sites,youwillfindthatsomeofthembegin

withhttp://www.yahoo.com/r,somebeginwith

http://uk.yahoo.com/r/hp/dr,andothersbeginwith

http://kr.yahoo.com.

Ifyoutrytolookforpatterns,youwillsoonfindyourselfwriting

longif/elsif/elsesectionsthatarehardtomaintainandneed

toberewritteneverytimeYahoo!makesasmallchangetoone

ofitssites.Ifyoufollowthatroute,youwillsoondiscoverthat

youneedtowritehundredsoflinesofcodetodescribeevery

kindofbehavioryouwanttobuildintoyourspider.

Thisisparticularlyfrustratingtoprogrammerswhoexpectto

writecodethatuseselegantalgorithmsandnicelystructured

data.Thehardtruthaboutportalsisthatyoucannotexpect

eleganceandeaseofspidering.Instead,prepareyourselffora

lotofdetectiveworkandwriting(andthrowingaway)chunksof

codeinahit-and-missfashion.Portalspidersarewritteninan

organic,unstructuredway,andtheonlyruleyoushouldfollow

istokeepthingssimpleandaddspecificfunctionalityonlyonce

youhavethegeneralbehaviorworking.

Okaywithtaxonomyandgeneraladvicebehindus,wecanget

tothegistofthematter.Thespiderinthishackisarelatively

simpletoolforcrawlingYahoo!sites.Itmakesnoassumptions

aboutthelayoutofthesites;infact,itmakesalmostno

assumptionswhatsoeverandcaneasilybeadaptedtoother

portalsorevengroupsofportals.Youcanuseitasaframework

forwritingspecializedspiders.



1.20.1.TheCode

Savethefollowingcodetoafilecalledyspider.pl:



































#!/usr/bin/perl-w

#

#yspider.pl

#

#Yahoo!Spider--crawlsYahoo!sites,collectslinksfr

#page,searcheseachdownloadedpageandprintsalist

#http://www.artymiak.com/software/orcontactjacek@ar

#

#Thiscodeisfreesoftware;youcanredistributeita

#modifyitunderthesametermsasPerlitself.

usestrict;

useGetopt::Std;





#parsecommand

useLWP::UserAgent;





#downloaddata

useHTML::LinkExtor;



#getlinksinsideanH

useURI::URL;





#turnrelative











my$help=<<"EOH";

------------------------------------------------------Yahoo!Spider.











Options:-s















Allowedvaluesof-sare:























argentina,asia,australia,brazil,canada,

catalan,china,denmark,france,germany,hongkong,

india,ireland,italy,japan,korea,mexico,

newzealand,norway,singapore,spain,sweden,taiwan,

uk,us,us_chinese,us_spanish

Please,usethiscoderesponsibly.Floodinganysite

withexcessivequeriesisbadnetcitizenship.

------------------------------------------------------EOH







#defineourargumentsand







-h



listofsitesyouwanttocrawl

e.g.-s'uschinadenmark'



printthishelp



















































#showthehelpifasked.

my%args;getopts("s:h",\%args);

die$helpifexists$args{h};

























#ifthe-soptionwasused,checktomake

#sureitmatchesoneofourexistingcodes

#above.ifnot,orno-swaspassed,help.

my@sites;#whichlocalestospider.

if(exists$args{'s'}){



@sites=split(//,lc($args{'s'}));



foreachmy$site(@sites){





die"UNKNOWN:$site\n\n$help"unless$y



}

}else{die$help;}









#DefinesglobalandlocalprofilesforURLsextracted

#downloadedpages.Theseprofilesareusedtodetermin



#Thelistofcodenames,and

#URLs,forvariousYahoo!sites.

my%ys=(



argentina=>"http://ar.yahoo.com",asia=>"ht



australia=>"http://au.yahoo.com",newzealand



brazil=>"http://br.yahoo.com",canada=>"htt



catalan=>"http://ct.yahoo.com",china=>"htt



denmark=>"http://dk.yahoo.com",france=>"ht



germany=>"http://de.yahoo.com",hongkong=>"



india=>"http://in.yahoo.com",italy=>"http:



korea=>"http://kr.yahoo.com",mexico=>"http



norway=>"http://no.yahoo.com",singapore=>"



spain=>"http://es.yahoo.com",sweden=>"http



taiwan=>"http://tw.yahoo.com",uk=>"http://



ireland=>"http://uk.yahoo.com",us=>"http:/



japan=>"http://www.yahoo.co.jp",



us_chinese=>"http://chinese.yahoo.com",



us_spanish=>"http://espanol.yahoo.com"

);















































































#URLsextractedfromeachnewdocumentshouldbeplace

#TODOlist(%todo)orrejected(%rejects).Profilesar

#madeofchunksoftext,whicharematchedagainstfou

#Anyspecialcharacters,likeslash(/)ordot(.)mus

#escaped.Rememberthatglobalshaveprecedenceoverl

my%rules=(



global=>{allow=>[],deny=>['search','\



argentina=>{allow=>['http:\/\/ar\.'],de



asia=>{allow=>['http:\/\/(aa|asia)\.'],



australia=>{allow=>['http:\/\/au\.'],de



brazil=>{allow=>['http:\/\/br\.'],deny



canada=>{allow=>['http:\/\/ca\.'],deny



catalan=>{allow=>['http:\/\/ct\.'],deny



china=>{allow=>['http:\/\/cn\.'],deny=



denmark=>{allow=>['http:\/\/dk\.'],deny



france=>{allow=>['http:\/\/fr\.'],deny



germany=>{allow=>['http:\/\/de\.'],deny



hongkong=>{allow=>['http:\/\/hk\.'],den



india=>{allow=>['http:\/\/in\.'],deny=



ireland=>{allow=>['http:\/\/uk\.'],deny



italy=>{allow=>['http:\/\/it\.'],deny=



japan=>{allow=>['yahoo\.co\.jp'],deny=



korea=>{allow=>['http:\/\/kr\.'],deny=



mexico=>{allow=>['http:\/\/mx\.'],deny



norway=>{allow=>['http:\/\/no\.'],deny



singapore=>{allow=>['http:\/\/sg\.'],de



spain=>{allow=>['http:\/\/es\.'],deny=



sweden=>{allow=>['http:\/\/se\.'],deny



taiwan=>{allow=>['http:\/\/tw\.'],deny



uk=>{allow=>['http:\/\/uk\.'],deny=>[



us=>{allow=>['http:\/\/(dir|www)\.'],de



us_chinese=>{allow=>['http:\/\/chinese\.'



us_spanish=>{allow=>['http:\/\/espanol\.'

);

my%todo=();#URLstoparse

my%done=();#parsed/finishedURLs

my%errors=();#brokenURLswitherrors.

















my%rejects=();#URLsrejectedbythescript































}





















#oncewe'realldonewithalltheURLs,weprinta

#reportaboutalltheinformationwe'vegonethrough.

print"="x80."\nURLsdownloadedandparsed:\n"."=

foreachmy$url(keys%done){print"$url=>$done{$ur

print"="x80."\nURLsthatcouldn'tbedownloaded:\n

foreachmy$url(keys%errors){print"$url=>$errors

print"="x80."\nURLsthatgotrejected:\n"."="x

foreachmy$url(keys%rejects){print"$url=>$rejec































#thisroutinegrabsthefirstentryinourtodo

#list,downloadsthecontent,andlooksformoreURLs.

#westayinwalksiteuntiltherearenomoreURLs

#inourtodolist,whichcouldbeagoodlongtime.

subwalksite{



#printouta"we'reoff!"line,then

#beginwalkingthesitewe'vebeentoldto.

print"="x80."\nStartedYahoo!spider…\n"."="x8

our$site;foreach$site(@sites){



#foreachofthesitesthathavebeenpassedo

#commandline,wemakeatitleforthem,addt

#theTODOlistfordownloading,thencallwalk

#whichdownloadstheURL,looksformoreURLs,

my$title="Yahoo!".ucfirst($site)."fron

$todo{$ys{$site}}=$title;walksite();#proc











do{







#getfirstURLtodo.

my$url=(keys%todo)[0];



















#downloadthisURL

print"->trying$url…\n";

my$browser=LWP::UserAgent->new;















my$resp=$browser->get($url,'User-A













































#checktheresults.

if($resp->is_success){



my$base=$resp->base||'';



print"->baseURL:$base\n";



my$data=$resp->content;#ge



print"->downloaded:".lengt





























































































































}





}else{





$errors{$url}=$resp->message();





print"->error:couldn'tdownloadURL:





delete$todo{$url};



}





#we'refinishedwiththisURL,somoveitfrom



#thetodolisttothedonelist,andprinta



$done{$url}=$todo{$url};delete$todo{$url};



print"->processedlegalURLs:".(scalarkey



print"->remainingURLs:".(scalarkeys%tod



print"-"x80."\n";

}until((scalarkeys%todo)==0);



#findURLsusingalinkextort

#willbeaddedtoourtodoli

#thispassesallthefoundlin

#below,whichdeterminesifwe

#toourtodolist,orignore

HTML::LinkExtor->new(\&findurls



###############################

#addyourownprocessinghere.

#akeywordsearchforthedown

###############################



#callbackroutineforHTML::LinkExtor.Forevery



#linkwefindinourdownloadedcontent,wecheck

#toseeifwe'veprocesseditbefore,thenrunit

#throughabevyofregexprules(seethetopof

#thisscript)toseeifitbelongsinthetodo.

subfindurls{



my($tag,%links)=@_;



returnif$tagne'a';



returnunless$links{href};



print"->foundURL:$links{href}\n";















#alreadyseenthisURL,somoveon.

if(exists$done{$links{href}}||



exists$errors{$links{href}}||



exists$rejects{$links{href}}){



print"-->I'veseenthisbefore:$links{href}\

}













#now,runthroughourfilters.

unless(exists($todo{$links{href}})){



my($ga,$gd,$la,$ld);#counters.



foreach(@{$rules{global}{'allow'}}){$ga++if



/$_/i;}







foreach(@{$rules{global}{'deny'}}){$gd++if



/$_/i;}







foreach(@{$rules{$site}{'allow'}}){$la++if



/$_/i;}







foreach(@{$rules{$site}{'deny'}}){$ld++if$











#ifthereweredenialsorNOallowances,wemo

if($gdor$ld){print"->rejectedURL:$link

unless($gaor$la){print"->rejectedURL:$











return;}















}

}



#wepassedourfilters,soadditonthebarby

print"->added$links{href}tomyTODOlist\n"

$todo{$links{href}}=$links{href};



1.20.2.RunningtheHack

Beforesendingthespideroff,you'llneedtomakeadecision

regardingwhichpartoftheYahoo!directoryyouwanttocrawl.

Ifyou'remainlyinterestedintheUnitedStatesandUnited

Kingdom,informthespiderofthatbyusingthe-soptiononthe

commandline,likeso:































%perlyspider.pl-s"usuk"

=======================================================

StartedYahoo!spider…

=======================================================

->tryinghttp://www.yahoo.com…

->baseURL:http://www.yahoo.com/

->downloaded:28376bytesofhttp://www.yahoo.com

->foundURL:http://www.yahoo.com/s/92802

->addedhttp://www.yahoo.com/s/92802tomyTODOlist

->foundURL:http://www.yahoo.com/s/92803

…etc…

->addedhttp://www.yahoo.com/r/pvtomyTODOlist

->processedlegalURLs:1

->remainingURLs:244









->tryinghttp://www.yahoo.com/r/fr…

->baseURL:http://fr.yahoo.com/r/













->downloaded:32619bytesofhttp://www.yahoo.com/r/fr

->foundURL:http://fr.yahoo.com/r/t/mu00

->rejectedURL:http://fr.yahoo.com/r/t/mu00





Youcanseeafulllistoflocationsavailabletoyoubyaskingfor

help:









%perlyspider.plh



Allowedvaluesof-sare:







argentina,asia,australia,brazil,canada,cat





denmark,france,germany,hongkong,india,irel

mexico,newzealand,norway,singapore,spain,sweden,





taiwan,uk,us,us_chinese,us_spanish



1.20.3.HackingtheHack

Thesectionyou'llwanttomodifymostcontainsthefiltersthat

determinehowfarthespiderwillgo;bytweakingtheallowand

denyrulesatthebeginningofthescript,you'llbeabletobetter

grabjustthecontentyou'reinterestedin.Ifyouwanttomake

thisspiderevenmoregeneric,considerrewritingthe

configurationcodesothatit'llinsteadreadaplain-textlistof

codenames,startURLs,andallowanddenypatterns.Thiscan

turnaYahoo!spiderintoageneralInternetspider.

Wheneveryouwanttoaddcodethatextendsthefunctionality

ofthisspider(suchassearchingforkeywordsinadocument,

addingthedownloadedcontenttoadatabase,orotherwise

repurposingitforyourneeds),includeyourownlogicwhere



specifiedbythehashed-outcommentblock.



1.20.4.SeeAlso

Ifyou'respideringYahoo!becauseyouwanttostartyour

owndirectory,youmightwanttoconsiderGoogle'sOpen

DirectoryProject(http://dmoz.org/about.html).

Downloadingtheproject'sfreelyavailabledirectorydata,all

severalhundredmegsofit,willgiveyouplentyof

informationtoplaywith.

JacekArtymiak







Hack20.BrowsetheYahoo!Directory



Whenyoudon'tknowexactlywhatyou'relookingfor,

theYahoo!Directorymightbeabletohelpyoufindit.

Yahoo!startedin1994asJerryYangandDavidFilo'sorganized

listoffavoritesitesthey'dfoundontheWeb.Yahoo!hasgrown

intomuch,muchmore,andmanypeoplethinkofYahoo!as

strictlyasearchcompany.Searchingisgreatwhenyouhavea

fairlygoodideaofwhatyou'relookingfor,buttheYahoo!

Directoryisagreatplacewhenyou'dratherbrowse.



1.21.1.SearchingVersusBrowsing

Therearetwodifferentkindsofshoppers,andtheyillustrate

thedifferencebetweensearchingandbrowsing.Someshoppers

knowexactlywhatthey'reafterandtheywanttofindastore

thatcarriesthatitem,locateitinthestore,andpurchaseitas

quicklyaspossible.Aswithawebsearch,ithelpstoknowabit

aboutwhatyou'relookingforifthisisyourstyle.Other

shopperswanttoexploreaparticularstore,seewhatthestore

offers,andchooseanitemiftherightonecomesalong.This

styleofbrowsingissuitedtopeoplewhowanttogetalarger

surveyofitemsinaparticularcategorybeforetheynecessarily

decidewhatthey'relookingfor.

Searchformsareobviouslybuiltforsearching.Directoriesare

builtforbrowsing.UnlikeYahoo!Searchresults,theYahoo!

Directorydoesn'ttrytoincludeeverypageitcanfindfrom

acrosstheWeb.Instead,thesiteslistedinthedirectoryare

handpickedandreviewedbypaidYahoo!editors.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Hack 19. Spider the Yahoo! Catalog

Tải bản đầy đủ ngay(0 tr)

×