Tải bản đầy đủ - 0 (trang)
2  Turning Spider Results into an Inventory

2  Turning Spider Results into an Inventory

Tải bản đầy đủ - 0trang

1 - response


1 - request


2 - response

2 - request









Figure 6-3. WebScarab spider results folder


After spidering the application, choose to save your work from within WebScarab.

Choose File → Save and enter a folder name. WebScarab will create a folder with the

name you give and populate it with a lot of information from your scan. Figure 6-3

shows a small example of what that hierarchy looks like.

Browse to that folder using Explorer (Windows), Finder (Mac OS), or a command line.

The file we are interested in is the conversationlog. It contains information on every

conversation the proxy had with the application during the spider process, including

result codes. Use grep to search the file for just the lines that begin with URL. Under

Unix, Linux, or Mac OS, the command is egrep '^URL' conversationlog. Under Windows, get either Cygwin (in which case you use the Unix command) or get WinGrep and

use it to search the file the same way. The result is one URL per line, as shown in

Example 6-1.

Example 6-1. Output from grep of the conversationlog file

URL: http://www.nova.org:80/

URL: http://www.nova.org:80/lib/exe/css.php

6.2 Turning Spider Results into an Inventory | 105

















This data can be loaded into a spreadsheet, script, or other test automation framework.


You can use various tools to pare down the list of URLs and make them unique or

otherwise eliminate URLs that are not important for testing. For example, the cascading

style sheets (.css) files are not dynamic, nor are image files (URLs typically ending

in .jpg, .gif, and .png). Example 6-2 shows an extra egrep command, which eliminates

these static URLs from our inventory, and some of the resulting output.

Example 6-2. Eliminating static content with egrep

% egrep '^URL: ' conversationlog | egrep -v '\.(css|jpg|gif|png|txt)'

URL: http://www.nova.org:80/

URL: http://www.nova.org:80/lib/exe/css.php

URL: http://www.nova.org:80/lib/exe/css.php?print=1

URL: http://www.nova.org:80/lib/exe/js.php?edit=0&write=0

URL: http://www.nova.org:80/lib/exe/indexer.php?id=welcome&1188529927

URL: http://www.nova.org:80/public:support:index

URL: http://www.nova.org:80/?idx=public:support

We’re getting a lot of mileage out of regular expressions with egrep. They

allow us to describe complex patterns to match, such as “a file whose

name ends in txt.” A full treatment of regular expressions is beyond the

scope of this book, but there are several good books for reference. A

couple good books include Tony Stubblebine’s Regular Expression

Pocket Reference and Jeffrey E. F. Friedl’s Mastering Regular Expressions (both O’Reilly). If you have Perl installed, you probably have Perl’s

documentation installed as well. Run perldoc perlre to see Perl’s builtin manual on regular expressions.

Another useful tool for paring down the list is to eliminate requests that only differ in

the parameters. This means looking for requests that have a question mark (?) in the

URL and eliminating duplicates there. We clearly need to record them in our test plan

as pages that need extra scrutiny (i.e., we will need test cases that address all the different

parameters). At this level, however, we’re just trying to identify all the different pages

in our application.

106 | Chapter 6: Automated Bulk Scanning

6.3 Reducing the URLs to Test


If your web application is relatively small, the work in Example 6-2 may have cut the

list down to a manageable size. If your application is large, however, you may find that

you want a list that does not include duplicates that differ only in parameters.


Let’s assume we have saved the output from Example 6-2 into a file named URLs.txt.

To start with we might use a command like cut −d " " −f 2 URLs.txt to get rid of the

URL: at the beginning of every line. The −d " " is how you tell cut that your delimiter

is a space character, and -f 2 says you want the second field. Since there’s just one

space on the line, this works well. That’s the first step.

We need to do the same trick, but use the question mark (again, there will be at most

one per line). Using either a Windows or Unix command line, we can pipe the output

of the first cut command into another. It’s more efficient than creating a bunch of

temporary files. The output of this command will yield a lot of duplicates: http://

www.nova.org:80?idx=public:support will become http://www.nova.org:80, which is

already in our list. We will eliminate all the duplicates we create this way, leaving just

the URLs up to the question mark and those that had no question mark at all. Example 6-3 shows the two cut commands with sort and uniq, two more Unix commands.

You have to sort the output for uniq to do its work (eliminating duplicates).

Example 6-3. Eliminating duplicate URLs

cut -d " " -f 2 URLs.txt | cut -d '?' -f 1 | sort | uniq > uniqURLs.txt


Example 6-2 shows two pages, css.php and the root URL itself http://www.nova.org:

80/ that appear twice, differing only in parameter composition. There are a couple good

ways to strip the list down further. Our favorite is the Unix command cut because it

can be very flexible about splitting input based on delimiters.

6.4 Using a Spreadsheet to Pare Down the List


You don’t have Cygwin installed for Windows, or you want a more Windows-oriented

ways to do processing of URLs.

6.4 Using a Spreadsheet to Pare Down the List | 107


You could load the file in Microsoft Excel and tell it that your text file is a delimited

file. Import into Excel using the question mark character and space characters as your

delimiters. You will have one entire column that is just “URL:”, one column that is

your unique URLs, and one column that has all the parameters, if there were any. You

can easily eliminate the two undesired columns, giving you just a list of pages. Excel

has a Data → Filter function that will copy rows from one column that has duplicates

into another column, only copying unique entries.


If you’re more familiar with Excel than Unix-style “pipelines,” this is probably faster

for you. If you’re already organizing test cases in Excel, it might be more conducive to

your workflow.

This exercise, following our example spidering of http://www.nova.org/, reduced an

initial list of 77 unique URLs (including static pictures and style sheets) down to 27

dynamically generated pages that contained some amount of business logic. That list

of 27 unique URLs would be a first pass at an estimate of 100% coverage. Investigation

might determine that some of the pages are, in fact, static and not needed to be tested.

Others might actually be duplicates (for example http://www.example.com/ and http://

www.example.com/index.html are typically exactly the same). In the end, we produce

a good starting point for full coverage tests.

6.5 Mirroring a Website with LWP


You don’t just want to know where the pages are, but you want to store a copy of the

contents of the pages themselves. You will actually download the web pages (whether

static or programmatically generated) and store it on your hard disk. We call this

mirroring, as opposed to spidering.* Although there are a number of web mirroring

programs—some commercial, some free—we are going to provide a single Perl script

as an example.

* There is no official or widely accepted term here. Some spiders also make local copies (mirrors) of the pages

they traverse. We are making the distinction on whether or not the program intentionally creates a copy of

what it spiders. Spiders traverse a website looking at all links and following as many as they can. Mirroring

programs do all that, and then they save copies of what they saw on local disk for offline perusal. WebScarab

spiders without mirroring. The lwp-rget command mirrors.

108 | Chapter 6: Automated Bulk Scanning

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2  Turning Spider Results into an Inventory

Tải bản đầy đủ ngay(0 tr)