Tải bản đầy đủ - 0trang
2 Turning Spider Results into an Inventory
1 - response
1 - request
2 - response
2 - request
Figure 6-3. WebScarab spider results folder
After spidering the application, choose to save your work from within WebScarab.
Choose File → Save and enter a folder name. WebScarab will create a folder with the
name you give and populate it with a lot of information from your scan. Figure 6-3
shows a small example of what that hierarchy looks like.
Browse to that folder using Explorer (Windows), Finder (Mac OS), or a command line.
The file we are interested in is the conversationlog. It contains information on every
conversation the proxy had with the application during the spider process, including
result codes. Use grep to search the file for just the lines that begin with URL. Under
Unix, Linux, or Mac OS, the command is egrep '^URL' conversationlog. Under Windows, get either Cygwin (in which case you use the Unix command) or get WinGrep and
use it to search the file the same way. The result is one URL per line, as shown in
Example 6-1. Output from grep of the conversationlog file
6.2 Turning Spider Results into an Inventory | 105
This data can be loaded into a spreadsheet, script, or other test automation framework.
You can use various tools to pare down the list of URLs and make them unique or
otherwise eliminate URLs that are not important for testing. For example, the cascading
style sheets (.css) files are not dynamic, nor are image files (URLs typically ending
in .jpg, .gif, and .png). Example 6-2 shows an extra egrep command, which eliminates
these static URLs from our inventory, and some of the resulting output.
Example 6-2. Eliminating static content with egrep
% egrep '^URL: ' conversationlog | egrep -v '\.(css|jpg|gif|png|txt)'
We’re getting a lot of mileage out of regular expressions with egrep. They
allow us to describe complex patterns to match, such as “a file whose
name ends in txt.” A full treatment of regular expressions is beyond the
scope of this book, but there are several good books for reference. A
couple good books include Tony Stubblebine’s Regular Expression
Pocket Reference and Jeffrey E. F. Friedl’s Mastering Regular Expressions (both O’Reilly). If you have Perl installed, you probably have Perl’s
documentation installed as well. Run perldoc perlre to see Perl’s builtin manual on regular expressions.
Another useful tool for paring down the list is to eliminate requests that only differ in
the parameters. This means looking for requests that have a question mark (?) in the
URL and eliminating duplicates there. We clearly need to record them in our test plan
as pages that need extra scrutiny (i.e., we will need test cases that address all the different
parameters). At this level, however, we’re just trying to identify all the different pages
in our application.
106 | Chapter 6: Automated Bulk Scanning
6.3 Reducing the URLs to Test
If your web application is relatively small, the work in Example 6-2 may have cut the
list down to a manageable size. If your application is large, however, you may find that
you want a list that does not include duplicates that differ only in parameters.
Let’s assume we have saved the output from Example 6-2 into a file named URLs.txt.
To start with we might use a command like cut −d " " −f 2 URLs.txt to get rid of the
URL: at the beginning of every line. The −d " " is how you tell cut that your delimiter
is a space character, and -f 2 says you want the second field. Since there’s just one
space on the line, this works well. That’s the first step.
We need to do the same trick, but use the question mark (again, there will be at most
one per line). Using either a Windows or Unix command line, we can pipe the output
of the first cut command into another. It’s more efficient than creating a bunch of
temporary files. The output of this command will yield a lot of duplicates: http://
www.nova.org:80?idx=public:support will become http://www.nova.org:80, which is
already in our list. We will eliminate all the duplicates we create this way, leaving just
the URLs up to the question mark and those that had no question mark at all. Example 6-3 shows the two cut commands with sort and uniq, two more Unix commands.
You have to sort the output for uniq to do its work (eliminating duplicates).
Example 6-3. Eliminating duplicate URLs
cut -d " " -f 2 URLs.txt | cut -d '?' -f 1 | sort | uniq > uniqURLs.txt
Example 6-2 shows two pages, css.php and the root URL itself http://www.nova.org:
80/ that appear twice, differing only in parameter composition. There are a couple good
ways to strip the list down further. Our favorite is the Unix command cut because it
can be very flexible about splitting input based on delimiters.
6.4 Using a Spreadsheet to Pare Down the List
You don’t have Cygwin installed for Windows, or you want a more Windows-oriented
ways to do processing of URLs.
6.4 Using a Spreadsheet to Pare Down the List | 107
You could load the file in Microsoft Excel and tell it that your text file is a delimited
file. Import into Excel using the question mark character and space characters as your
delimiters. You will have one entire column that is just “URL:”, one column that is
your unique URLs, and one column that has all the parameters, if there were any. You
can easily eliminate the two undesired columns, giving you just a list of pages. Excel
has a Data → Filter function that will copy rows from one column that has duplicates
into another column, only copying unique entries.
If you’re more familiar with Excel than Unix-style “pipelines,” this is probably faster
for you. If you’re already organizing test cases in Excel, it might be more conducive to
This exercise, following our example spidering of http://www.nova.org/, reduced an
initial list of 77 unique URLs (including static pictures and style sheets) down to 27
dynamically generated pages that contained some amount of business logic. That list
of 27 unique URLs would be a first pass at an estimate of 100% coverage. Investigation
might determine that some of the pages are, in fact, static and not needed to be tested.
Others might actually be duplicates (for example http://www.example.com/ and http://
www.example.com/index.html are typically exactly the same). In the end, we produce
a good starting point for full coverage tests.
6.5 Mirroring a Website with LWP
You don’t just want to know where the pages are, but you want to store a copy of the
contents of the pages themselves. You will actually download the web pages (whether
static or programmatically generated) and store it on your hard disk. We call this
mirroring, as opposed to spidering.* Although there are a number of web mirroring
programs—some commercial, some free—we are going to provide a single Perl script
as an example.
* There is no official or widely accepted term here. Some spiders also make local copies (mirrors) of the pages
they traverse. We are making the distinction on whether or not the program intentionally creates a copy of
what it spiders. Spiders traverse a website looking at all links and following as many as they can. Mirroring
programs do all that, and then they save copies of what they saw on local disk for offline perusal. WebScarab
spiders without mirroring. The lwp-rget command mirrors.
108 | Chapter 6: Automated Bulk Scanning