Tải bản đầy đủ - 0trang
6 Working with HTML Entity Data
The easiest choice for this kind of encoding and decoding is CAL9000. We won’t repeat
the detailed instructions on how to use CAL9000 because it is pretty straightforward
to use. See Recipe 4.5 for detailed instructions.
To encode special characters, you enter the special characters in the box labeled “Plain
Text” and choose your encoding. You will want to enter a semicolon (;) in the “Trailing
Characters” box in CAL9000.
Decoding HTML Entity-encoded characters is the same process in reverse. Type or
paste the entity-encoded characters into the “encoded text box” and then click on the
“HTML Entity” entry under “Select Decoding Type.”
HTML entity encoding is an area rich with potential mistakes. The authors have seen
many web applications perform multiple rounds of entity encoding (e.g., the ampersand is encoded as &) in one part of the display and perform no entity encoding
in other parts of the display. Not only is it important to do correctly, it turns out that
since there are so many variations on HTML entity encoding, it is very challenging to
write a web application that does handle encoding correctly.
Variations on a theme
There are at least five or six legitimate, relatively well-known methods to encode the
same character using HTML entity encoding. Table 4-1 shows a few possible encodings
for a single character: the less-than sign (<).
Table 4-1. Variations on entity encoding
Decimal value (ASCII or ISO-8859-1)
Hexadecimal value (ASCII or ISO-8859-1)
Hexadecimal value (long integer)
Hexadecimal value (64-bit integer)
There are even a few more encoding methods that are specific to Internet Explorer.
Clearly, from a testing point of view, if you have boundary values or special values you
want to test, you have at least six to eight permutations of them: two or three URLencoded versions and four or five entity-encoded versions.
64 | Chapter 4: Web-Oriented Data Encoding
The Devil Is in the Details
Some of the reasons that handling encodings is so difficult for an application programmer are because there are so many different places where encoding and decoding must
occur, and because there are so many unrelated components performing encoding and
decoding functions. Consider the most common, basic GET request. The web browser
takes a first pass at encoding data that it thinks needs encoding, but web browsers differ
in a few corner cases. Then the web server itself (e.g., IIS or Apache) may perform some
encoding on inbound data that the web browser left unencoded. Next, any platform
that the code runs on may try to interpret, encode, or decode some of the data stream.
For instance, .Net and Java web environments implicitly handle most kinds of URL
and entity encodings. Finally, the application software itself may encode or decode data
that is stored in a database, file, or other permanent storage. Trying to ensure that data
remains encoded in the correct form throughout this entire call sequence (from the
browser all the way into the application) is very difficult, to say the least. Root-cause
analysis when there is a problem is equally difficult.
4.7 Calculating Hashes
When your application uses hashes, checksums, or other integrity checks over its data,
you need to recognize them and possibly calculate them on test data. If you are unfamiliar with hashes, see the upcoming sidebar “What Are Hashes?.”
As with other encoding tasks, you have at least three good choices: OpenSSL, CAL9000,
% echo -n "my data" | openssl md5
c:\> type myfile.txt | openssl md5
use Digest::SHA1 qw(sha1);
= "my data";
$digest = sha1($data);
4.7 Calculating Hashes | 65
What Are Hashes?
Hashes are one-way mathematical functions. Given any amount of input, they produce
exactly the same size output. Cryptographically strong hashes, the kind that are used
in our most important security functions, have several important properties:
• Preimage resistance: given a hash value, it should be hard to find a document or
input data that would produce that hash
• Collision resistance: given a document or some input, it should be hard to find
another document or input that will have the same hash value.
In both of those properties, we say that something should be “hard to find.” We mean
that, even if it’s theoretically possible, it should be so time-consuming or so unlikely
that an attacker can’t use the property of the hash in a practical attack.
The MD5 case is shown using OpenSSL on Unix or on Windows. OpenSSL has an
equivalent sha1 command. Note that the -n is required on Unix echo command to
prevent the newline character from being added on the end of your data. Although
Windows has an echo command, you can’t use it the same way because there is no way
to suppress the carriage-return/linefeed set of characters on the end of the message you
The SHA-1 case is shown as a Perl script, using the Digest::SHA1 module. There is an
equivalent Digest::MD5 module that works the same way for MD5 hashes.
Note that there is no way to decode a hash. Hashes are mathematical digests that are
one-way. No matter how much data goes in, the hash produces exactly the same size
MD5 hashes produce exactly 128 bits (16 bytes) of data. You might see this represented
in a few different ways:
32 hexadecimal characters
24 Base 64 characters
PlnPFeQx5Jj+uwRfh//RSw==. You will see it this way if they take the binary output
of MD5 (the raw 128 binary bits) and then Base-64 encode it.
SHA-1 is a hash that always produces exactly 160 bits (20 bytes) of data. Like MD5,
you might see this represented in a few ways:
66 | Chapter 4: Web-Oriented Data Encoding
40 hexadecimal characters
28 Base-64 characters
Hashes and Security
A common security mistake in application development is to store or transmit hashed
versions of passwords and consider them safe. Other common uses of hashes are to
hash credit cards, Social Security numbers, or other private information. The problem
with this approach, from a security point of view, is that hashes can be replayed just
like the passwords they represent. If the authenticator for an application is a user ID
and a SHA-1 hash of the password, the application may still be insecure. Capturing
and replaying the hash (though the actual password remains unknown to an attacker)
may be sufficient to authenticate. Be skeptical when you see hashed passwords or
hashes of other sensitive information. Often an attacker need not know the plain-text
information if capturing and replaying the hash will be considered authentic.
4.8 Recognizing Time Formats
You are likely to see time represented in a lot of different ways. Recognizing a representation of time for what it is will help you build better test cases. Not only knowing
that it is time, but knowing what the programmer’s fundamental assumptions might
have been when the code was written makes it easier to write targeted test cases.
Obvious time formats encode the year, month, and day in familiar arrangements, providing either two or four digits for the year. Some include hours, minutes, and seconds,
possibly with a decimal and milliseconds. Table 4-2 shows several representations of
June 1, 2008, 5:32:11 p.m. and 844 milliseconds. Some of the formats do not represent
certain parts of the date or time. The unrepresentable parts are omitted as appropriate.
Table 4-2. Various representations of time
Unix time (Seconds since Jan 1, 1970)
POSIX in “C” locale
Sun Jun 1 17:32:11 2008
4.8 Recognizing Time Formats | 67
You may think that recognizing time is pretty obvious and not important to someone
testing web applications, especially for security. We would argue that it is actually very
important. The authors have seen many applications where time was considered to be
unpredictable by the developers. Time has been used in session IDs, temporary filenames, temporary passwords, and account numbers. As a simulated attacker, you
know that time is not unpredictable. As we plan “interesting” test cases on a given input
field, we can narrow down the set of possible test values dramatically if we know it
corresponds to a time value from the recent past or recent future.
Milliseconds and Unpredictability
Never let anyone persuade you that millisecond values are unpredictable. Intuitively
one would expect that no one knows when a web request is going to be made. Thus,
if the software reads the clock and extracts just the millisecond value, each of the thousand millisecond values (0 to 999) should be equally probable, right? Your intuition
might say yes, but the true answer is no. It turns out that some values are much more
likely than others. Various factors (granularity of time-slicing in the operating system
kernel—whether Unix or Windows, clock granularity, interrupts, and more) make the
clock a very bad source of randomness. Read Chapter 10 in Viega and McGraw’s book
Building Secure Software (Addison-Wesley) for a more thorough discussion of this
As a tester, you should strongly suspect any software system that is relying on some
time-based element to introduce unpredictability. If you discover such an element in
your software, you should immediately begin considering questions like “what if that
is actually guessable?” or “what if two supposedly random values come out the same?”
4.9 Encoding Time Values Programmatically
You have determined that your application uses time in some interesting way, and now
you want to generate specific values in specific formats.
Perl is a great tool for this job. You will need the Time::Local module for some manipulations of Unix time and the POSIX module for strftime. Both are standard modules.
The code in Example 4-3 shows you four different formats and how to calculate them.
68 | Chapter 4: Web-Oriented Data Encoding