Tải bản đầy đủ - 0 (trang)
6  Working with HTML Entity Data

6  Working with HTML Entity Data

Tải bản đầy đủ - 0trang


The easiest choice for this kind of encoding and decoding is CAL9000. We won’t repeat

the detailed instructions on how to use CAL9000 because it is pretty straightforward

to use. See Recipe 4.5 for detailed instructions.

To encode special characters, you enter the special characters in the box labeled “Plain

Text” and choose your encoding. You will want to enter a semicolon (;) in the “Trailing

Characters” box in CAL9000.

Decoding HTML Entity-encoded characters is the same process in reverse. Type or

paste the entity-encoded characters into the “encoded text box” and then click on the

“HTML Entity” entry under “Select Decoding Type.”


HTML entity encoding is an area rich with potential mistakes. The authors have seen

many web applications perform multiple rounds of entity encoding (e.g., the ampersand is encoded as &) in one part of the display and perform no entity encoding

in other parts of the display. Not only is it important to do correctly, it turns out that

since there are so many variations on HTML entity encoding, it is very challenging to

write a web application that does handle encoding correctly.

Variations on a theme

There are at least five or six legitimate, relatively well-known methods to encode the

same character using HTML entity encoding. Table 4-1 shows a few possible encodings

for a single character: the less-than sign (<).

Table 4-1. Variations on entity encoding

Encoding variation

Encoded character

Named entity


Decimal value (ASCII or ISO-8859-1)


Hexadecimal value (ASCII or ISO-8859-1)


Hexadecimal value (long integer)


Hexadecimal value (64-bit integer)


There are even a few more encoding methods that are specific to Internet Explorer.

Clearly, from a testing point of view, if you have boundary values or special values you

want to test, you have at least six to eight permutations of them: two or three URLencoded versions and four or five entity-encoded versions.

64 | Chapter 4: Web-Oriented Data Encoding

The Devil Is in the Details

Some of the reasons that handling encodings is so difficult for an application programmer are because there are so many different places where encoding and decoding must

occur, and because there are so many unrelated components performing encoding and

decoding functions. Consider the most common, basic GET request. The web browser

takes a first pass at encoding data that it thinks needs encoding, but web browsers differ

in a few corner cases. Then the web server itself (e.g., IIS or Apache) may perform some

encoding on inbound data that the web browser left unencoded. Next, any platform

that the code runs on may try to interpret, encode, or decode some of the data stream.

For instance, .Net and Java web environments implicitly handle most kinds of URL

and entity encodings. Finally, the application software itself may encode or decode data

that is stored in a database, file, or other permanent storage. Trying to ensure that data

remains encoded in the correct form throughout this entire call sequence (from the

browser all the way into the application) is very difficult, to say the least. Root-cause

analysis when there is a problem is equally difficult.

4.7 Calculating Hashes


When your application uses hashes, checksums, or other integrity checks over its data,

you need to recognize them and possibly calculate them on test data. If you are unfamiliar with hashes, see the upcoming sidebar “What Are Hashes?.”


As with other encoding tasks, you have at least three good choices: OpenSSL, CAL9000,

and Perl.


% echo -n "my data" | openssl md5

c:\> type myfile.txt | openssl md5



use Digest::SHA1 qw(sha1);


= "my data";

$digest = sha1($data);

print "$digest\n";

4.7 Calculating Hashes | 65

What Are Hashes?

Hashes are one-way mathematical functions. Given any amount of input, they produce

exactly the same size output. Cryptographically strong hashes, the kind that are used

in our most important security functions, have several important properties:

• Preimage resistance: given a hash value, it should be hard to find a document or

input data that would produce that hash

• Collision resistance: given a document or some input, it should be hard to find

another document or input that will have the same hash value.

In both of those properties, we say that something should be “hard to find.” We mean

that, even if it’s theoretically possible, it should be so time-consuming or so unlikely

that an attacker can’t use the property of the hash in a practical attack.


The MD5 case is shown using OpenSSL on Unix or on Windows. OpenSSL has an

equivalent sha1 command. Note that the -n is required on Unix echo command to

prevent the newline character from being added on the end of your data. Although

Windows has an echo command, you can’t use it the same way because there is no way

to suppress the carriage-return/linefeed set of characters on the end of the message you

give it.

The SHA-1 case is shown as a Perl script, using the Digest::SHA1 module. There is an

equivalent Digest::MD5 module that works the same way for MD5 hashes.

Note that there is no way to decode a hash. Hashes are mathematical digests that are

one-way. No matter how much data goes in, the hash produces exactly the same size


MD5 hashes

MD5 hashes produce exactly 128 bits (16 bytes) of data. You might see this represented

in a few different ways:

32 hexadecimal characters


24 Base 64 characters

PlnPFeQx5Jj+uwRfh//RSw==. You will see it this way if they take the binary output

of MD5 (the raw 128 binary bits) and then Base-64 encode it.

SHA-1 hashes

SHA-1 is a hash that always produces exactly 160 bits (20 bytes) of data. Like MD5,

you might see this represented in a few ways:

66 | Chapter 4: Web-Oriented Data Encoding

40 hexadecimal characters


28 Base-64 characters


Hashes and Security

A common security mistake in application development is to store or transmit hashed

versions of passwords and consider them safe. Other common uses of hashes are to

hash credit cards, Social Security numbers, or other private information. The problem

with this approach, from a security point of view, is that hashes can be replayed just

like the passwords they represent. If the authenticator for an application is a user ID

and a SHA-1 hash of the password, the application may still be insecure. Capturing

and replaying the hash (though the actual password remains unknown to an attacker)

may be sufficient to authenticate. Be skeptical when you see hashed passwords or

hashes of other sensitive information. Often an attacker need not know the plain-text

information if capturing and replaying the hash will be considered authentic.

4.8 Recognizing Time Formats


You are likely to see time represented in a lot of different ways. Recognizing a representation of time for what it is will help you build better test cases. Not only knowing

that it is time, but knowing what the programmer’s fundamental assumptions might

have been when the code was written makes it easier to write targeted test cases.


Obvious time formats encode the year, month, and day in familiar arrangements, providing either two or four digits for the year. Some include hours, minutes, and seconds,

possibly with a decimal and milliseconds. Table 4-2 shows several representations of

June 1, 2008, 5:32:11 p.m. and 844 milliseconds. Some of the formats do not represent

certain parts of the date or time. The unrepresentable parts are omitted as appropriate.

Table 4-2. Various representations of time


Example output





Unix time (Seconds since Jan 1, 1970)


POSIX in “C” locale

Sun Jun 1 17:32:11 2008

4.8 Recognizing Time Formats | 67


You may think that recognizing time is pretty obvious and not important to someone

testing web applications, especially for security. We would argue that it is actually very

important. The authors have seen many applications where time was considered to be

unpredictable by the developers. Time has been used in session IDs, temporary filenames, temporary passwords, and account numbers. As a simulated attacker, you

know that time is not unpredictable. As we plan “interesting” test cases on a given input

field, we can narrow down the set of possible test values dramatically if we know it

corresponds to a time value from the recent past or recent future.

Milliseconds and Unpredictability

Never let anyone persuade you that millisecond values are unpredictable. Intuitively

one would expect that no one knows when a web request is going to be made. Thus,

if the software reads the clock and extracts just the millisecond value, each of the thousand millisecond values (0 to 999) should be equally probable, right? Your intuition

might say yes, but the true answer is no. It turns out that some values are much more

likely than others. Various factors (granularity of time-slicing in the operating system

kernel—whether Unix or Windows, clock granularity, interrupts, and more) make the

clock a very bad source of randomness. Read Chapter 10 in Viega and McGraw’s book

Building Secure Software (Addison-Wesley) for a more thorough discussion of this


As a tester, you should strongly suspect any software system that is relying on some

time-based element to introduce unpredictability. If you discover such an element in

your software, you should immediately begin considering questions like “what if that

is actually guessable?” or “what if two supposedly random values come out the same?”

4.9 Encoding Time Values Programmatically


You have determined that your application uses time in some interesting way, and now

you want to generate specific values in specific formats.


Perl is a great tool for this job. You will need the Time::Local module for some manipulations of Unix time and the POSIX module for strftime. Both are standard modules.

The code in Example 4-3 shows you four different formats and how to calculate them.

68 | Chapter 4: Web-Oriented Data Encoding

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

6  Working with HTML Entity Data

Tải bản đầy đủ ngay(0 tr)