Appendix

G

Problem Characters and Sample Test Input

This appendix contains sample input that has a high likelihood of causing misbehavior in many different types of applications. The exact usage varies depending on the application–some will be sensitive to these cases in a URL, others through a text input field, and others will be very tolerant of the data and behave correctly. Many applications will have their own sets of problematic input that may contain these and may have some unique ones.

Characters from the Single-Byte Character Sets

Control Characters

The control characters in Table G.1 are often left off of code pages because these first 32 code points are common to them all but are nonprintable entities.

Table G.1 Control Characters

Unicode Point / Abbreviation / Keystroke / Name / Comments
[U+0000] / NUL / Ctrl+@ / NULL / This needs to be tested in every place where data can be input or stored; many systems will crash or fail when this is encountered because they are not expecting this; code needs to handle these situations gracefully.
[U+0001] / SOH / Ctrl+A / START OF HEADING
[U+0002] / STX / Ctrl+B / START OF TEXT
[U+0003] / ETX / Ctrl+C / END OF TEXT
[U+0004] / EOT / Ctrl+D / END OF TRANSMISSION
[U+0005] / ENQ / Ctrl+E / ENQUIRY
[U+0006] / ACK / Ctrl+F / ACKNOWLEDGE
[U+0007] / BEL / Ctrl+G / BELL / (Beep)—caused teletype machines to ring a bell; will cause many common terminal/term emulation programs to beep.
[U+0008] / BS / Ctrl+H / BACKSPACE
[U+0009] / HT / Ctrl+I / HORIZONTAL TAB
[U+000A] / LF / Ctrl+J / LINE FEED
[U+000B] / VT / Ctrl+K / VERTICAL TAB
[U+000C] / FF / Ctrl+L / FORM FEED
[U+000D] / CR / Ctrl+M / CARRIAGE RETURN
[U+000E] / SO / Ctrl+N / SHIFT OUT / Switches output device to alternate character set.
[U+000F] / SI / Ctrl+O / SHIFT IN / Switches output device to default character set.
[U+0010] / DLE / Ctrl+P / DATA LINK ESCAPE
[U+0011] / DC1 / Ctrl+Q / DEVICE CONTROL 1 / Also the XON command for a modem soft handshake.
[U+0012] / DC2 / Ctrl+R / DEVICE CONTROL 2
[U+0013] / DC3 / Ctrl+S / DEVICE CONTROL 3 / Also the XOFF command for the modem soft handshake.
[U+0014] / DC4 / Ctrl+T / DEVICE CONTROL 4
[U+0015] / NAK / Ctrl+U / NEGATIVE ACKNOWLEDGE
[U+0016] / SYN / Ctrl+V / SYNCHRONOUS IDLE
[U+0017] / ETB / Ctrl+W / END OF TRANSMISSION BLOCK
[U+0018] / CAN / Ctrl+X / CANCEL
[U+0019] / EM / Ctrl+Y / END OF MEDIUM
[U+001A] / SUB / Ctrl+Z / SUBSTITUTE
[U+001B] / ESC / Ctrl+[ / ESCAPE
[U+001C] / FS / Ctrl+\ / FILE SEPARATOR
[U+001D] / GS / Ctrl+] / GROUP SEPARATOR
[U+001E] / RS / Ctrl+^ / RECORD SEPARATOR
[U+001F] / US / Ctrl+_ / UNIT SEPARATOR

IBM PC Keyboard Scan Codes

For special key combinations (for example, Alt+S, F5, and so on), a special two-character escape sequence is used. Depending on the language, the escape character can be either Escape [U+001B] or NUL [U+0000]. I will assume that NUL is being used in Table G.2. Having these codes can be very useful for automation or other places where you need to send particular keys.

Table G.2 IBM PC Keyboard Scan Codes

Key Combination / Escape Sequence
Alt+A / [U+0000][U+001E]
Alt+B / [U+0000][U+0030]
Alt+C / [U+0000][U+002E]
Alt+D / [U+0000][U+0020]
Alt+E / [U+0000][U+0012]
Alt+F / [U+0000][U+0021]
Alt+G / [U+0000][U+0022]
Alt+H / [U+0000][U+0023]
Alt+I / [U+0000][U+0017]
Alt+J / [U+0000][U+0024]
Alt+K / [U+0000][U+0025]
Alt+L / [U+0000][U+0026]
Alt+M / [U+0000][U+0032]
Alt+N / [U+0000][U+0031]
Alt+O / [U+0000][U+0018]
Alt+P / [U+0000][U+0019]
Alt+Q / [U+0000][U+0010]
Alt+R / [U+0000][U+0013]
Alt+S / [U+0000][U+001A]
Alt+T / [U+0000][U+0014]
Alt+U / [U+0000][U+0016]
Alt+V / [U+0000][U+002F]
Alt+W / [U+0000][U+0011]
Alt+X / [U+0000][U+002D]
Alt+Y / [U+0000][U+0015]
Alt+Z / [U+0000][U+002C]
PGUP / [U+0000][U+0049]
PGDN / [U+0000][U+0051]
HOME / [U+0000][U+0047]
END / [U+0000][U+004F]
UPARRW / [U+0000][U+0048]
DNARRW / [U+0000][U+0050]
LFTARRW / [U+0000][U+004B]
RTARRW / [U+0000][U+004D]
F1 / [U+0000][U+003B]
F2 / [U+0000][U+003C]
F3 / [U+0000][U+003D]
F4 / [U+0000][U+003E]
F5 / [U+0000][U+003F]
F6 / [U+0000][U+0040]
F7 / [U+0000][U+0041]
F8 / [U+0000][U+0042]
F9 / [U+0000][U+0043]
F10 / [U+0000][U+0044]
F11 / [U+0000][U+0085]
F12 / [U+0000][U+0086]
Alt+F1 / [U+0000][U+0068]
Alt+F2 / [U+0000][U+0069]
Alt+F3 / [U+0000][U+006A]
Alt+F4 / [U+0000][U+006B]
Alt+F5 / [U+0000][U+006C]
Alt+F6 / [U+0000][U+006D]
Alt+F7 / [U+0000][U+006E]
Alt+F8 / [U+0000][U+006F]
Alt+F9 / [U+0000][U+0070]
Alt+F10 / [U+0000][U+0071]
Alt+F11 / [U+0000][U+008B]
Alt+F12 / [U+0000][U+008C]

Character Combinations

Using the control characters mentioned previously in this appendix, each separately, is one type of test case; however, they can sometimes be handled correctly individually yet mean something special when used in certain combinations. Below is one key combination to test that uses the control characters.

[U+000D][U+000A] — CRLF or (CR)(LF), carriage return, and a line feed — means multiple things, such as the end of a packet segment; two of these in a row also need to be tested as input or within a stream of input because many protocols see two in a row as the end of a transmission.

Lower ASCII

Table G.3 provides some information about each potentially problematic lower ASCII character. Depending on the usage and context, these characters can mean very different things. The notations are just suggestions about how a character could be a sensitive or unwise character.

Table G.3 Lower ASCII Problematic Characters

Character / Code page point / Unicode point / Name / Comment
0x20 / [U+0020] / Space / Also a C reserved char—very useful for turning up problems if first, last, or only char entered; problematic in a URL
! / 0x21 / [U+0021] / Exclamation mark / Problematic in a URL
" / 0x22 / [U+0022] / Double quotes / A C reserved char and delimiter; problematic in a URL
# / 0x23 / [U+0023] / Number sign / May be a delimiter; problematic in a URL
$ / 0x24 / [U+0024] / Dollar sign / A reserved character in a query component
% / 0x25 / [U+0025] / Percent / A C reserved char or a delimiter
0x26 / [U+0026] / Ampersand / Character in a query component; problematic in a URL
' / 0x27 / [U+0027] / Apostrophe / A C reserved char and unwise to leave unescaped; problematic in a URL
( / 0x28 / [U+0028] / Left parenthesis / Problematic in a URL
) / 0x29 / [U+0029] / Right parenthesis / Problematic in a URL
* / 0x2A / [U+002A] / Asterisk
+ / 0x2B / [U+002B] / Plus sign / Character in a query component; problematic in a URL
, / 0x2C / [U+002C] / Comma / Character in a query component; problematic in a URL
- / 0x2D / [U+002D] / Hyphen — minus
. / 0x2E / [U+002E] / Full stop (period) / Especially as last char of a file name
/ / 0x2F / [U+002F] / Solidus (slash) / Especially as last char of a file name; also a C reserved char or reserved in a query component; problematic in a URL
: / 0x3A / [U+003A] / Colon / A reserved character in a query component; problematic in a URL
; / 0x3B / [U+003B] / Semicolon / A valid char in a URL, however can be problematic; may want to escape anyway; reserved within a query component, can be a parameter delimiter.
0x3C / [U+003C] / Less-than sign / Can be a delimiter or part of HTML or script; problematic in a URL
= / 0x3D / [U+003D] / Equals sign / Reserved character in a query component; problematic in a URL
0x3E / [U+003E] / Greater-than sign / Can be a delimiter or part of HTML or script; problematic in a URL
? / 0x3F / [U+003F] / Question mark / Reserved character in a query component; problematic in a URL
@ / 0x40 / [U+0040] / Commercial At (at sign) / Reserved character in a query component; problematic in a URL unless part of the authentication
[ / 0x5B / [U+005B] / Left square bracket / An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
\ / 0x5C / [U+005C] / Reverse solidus (backslash) / Especially as last char of a file name; an unwise character to leave unescaped; problematic in a URL
] / 0x5D / [U+005D] / Right square bracket / An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
^ / 0x5E / [U+005E] / Circumflex accent / An unwise character to leave unescaped; problematic in a URL
_ / 0x5F / [U+005F] / Low line / An unwise character to leave unescaped; problematic in a URL
` / 0x60 / [U+0060] / Grave accent / An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
{ / 0x7B / [U+007B] / Left curly brace / An unwise character to leave unescaped; problematic in a URL
| / 0x7C / [U+007C] / Vertical line (pipe) / An unwise character to leave unescaped; problematic in a URL ; also problematic in RTL
} / 0x7D / [U+007D] / Right curly brace
~ / 0x7E / [U+007E] / Tilde
/ 0x7F / [U+007F] / Delete
« / 0xAB / [U+00AB] / Left-pointing double angle
_ / 0x1C / [U+001C] / File Separator

Extended Range Problem Characters

Table G.4 contains potentially problematic extended range characters from the single-byte code pages.

Table G.4 Extended Range Problem Characters

Character / Unicode point / Name / Comment
ö / [U+00F6] / Latin Small Letter O with Diaeresis / Can be a problem in filenames on DBCS systems.
§ / [U+00A7] / Section Sign
ß / [U+00DF] / Latin Small Letter Sharp S
å / [U+00E5] / Latin Small Letter A with Ring Above / DOS delete marker. Mostly significant if first char in a string; essentially this is a Ctrl+z.
€ / [U+20AC] / Euro Currency Symbol
ª / [U+00AA] / Feminine Ordinal Indicator / This can sometimes be interpreted by Novell’s NetWare as a disconnect signal or other similar low-level command. If your software will be used with NetWare, you will want to plan your tests to include these.
® / [U+00AE] / Registered Sign / This can sometimes be interpreted by Novell’s NetWare as a disconnect signal or other similar low-level command. If your software will be used with NetWare, you will want to plan your tests to include these.
¿ / [U+00BF] / Inverted Question Mark / This can sometimes be interpreted by Novell’s NetWare as a disconnect signal or other similar low-level command. If your software will be used with NetWare, you will want to plan your tests to include these.
İ / [U+0130]
0xDD on 1254 code page / Latin Capital Letter I with Dot Above / Only found in Turkish on the 1254 code page; this can be seen being converted if the system does not properly handle this.
ı / [U+0131]
0xFD on 1254 code page / Latin Small Dotless Letter I / Only found in Turkish on the 1254 code page; this can be seen being converted if the system does not properly handle this.

Problem Character Combinations

Table G.5 contains problem character combinations from the lower ASCII, the extended range (or upper ASCII), and then combinations of the two.

Table G.5 Problem Character Combinations

Characters / Unicode points / Names / Comment
:: / [U+003A][U+003A] / Two colons
~1: / [U+007E][U+0031][U+003A] / A tilde, a number (any number), and a colon
.. / [U+002E][U+002E] / Two periods / This can present security problems by allowing access to files otherwise not accessible.
$$ / [U+0024][U+0024] / Two dollar signs
:€? / [U+003A][U+20AC][U+FFFD] / Colon, Euro symbol, and [U+FFFD] / Although FFFD is not a “real” character, this can present problems.
++ / [U+002B][U+002B] / Two pluses
%0 / [U+0025][U+0030] / Percent sign, number zero / Can cause problems in Perl scripts.
\n / [U+005C][U+006E] / Backslash, letter n / Escape sequence for new line in JavaScript.
\b / [U+005C][U+0062] / Backslash, letter b / Escape sequence for bolding in JavaScript.
%20 / [U+0025][U+0032][U+0030] / Percent sign, number two, number zero / URL encoded sequence for a space.
00:\ / [U+0030][U+0030][U+003A][U+005C] / Two number zeros, colon, backslash
[U+0026] / Ampersand
[U+003C] / Less-than sign
[U+003E] / Greater-than sign
= / [U+003D] / Equals sign
Ü¢£ / [U+00DC][U+00A2][U+00A3] / Letter U with diaeresis, cent sign, pound (currency) sign — high literals
FFFFFFFF / [U+0046][U+0046][U+0046][U+0046][U+0046][U+0046][U+0046][U+0046] / Eight letter F / Input as a value, especially a regkey.
::$DATA / [U+003A][U+003A][U+0024][U+0044][U+0041][U+0054][U+0041] / Two colons, dollar sign, letters D, A, T, A / Indicates data stream.

Lower ASCII Character Combination Verification Cases

Table G.6 contains test cases to try in order to verify that your application properly handles various lower ASCII characters. Whereas the previous set of character combinations were chosen because of their potential ability to break an application, these are chosen for their ability to prove that the application is properly handling valid lower ASCII input.

Table G.6 Character Combination Verification Cases

Characters / Unicode point / Comment
aAzZ / [U+0061][U+0041][U+007A][U+005A] / Tests that basic alphabetic characters are accepted.
1234 / [U+0031][U+0032][U+0033][U+0034] / Tests that common numbers are accepted.
12aZ / [U+0031][U+0032][U+007A][U+005A] / Tests that numbers and letters are accepted, starting with numbers.
aZ12 / [U+007A][U+005A][U+0031][U+0032] / Tests that letters and numbers are accepted, ending with numbers.
~!;:?/* / [U+007E][U+0021][U+003B][U+003A][U+003F][U+002F][U+002A] / Tests that common symbols are accepted.
/../ / [U+002F][U+002E][U+002E][U+002F] / Tests symbols, but in an arrangement that can be interpreted as a file path.
..%255c.. / [U+002E][U+002E][U+0025][U+0032][U+0035][U+0035][U+0063][U+002E][U+002E] / Test case for URL canonicalization.
..%%35%63.. / [U+002E][U+002E][U+0025][U+0025][U+0033][U+0035][U+0025][U+0036][U+0033][U+002E][U+002E] / Test case for URL canonicalization.
..%%35c.. / [U+002E][U+002E][U+0025][U+0025][U+0033][U+0035][U+0063][U+002E][U+002E] / Test case for URL canonicalization.
..%25%35%63.. / [U+002E][U+002E][U+0025][U+0032][U+0035][U+0025][U+0033][U+0035][U+0025][U+0036][U+0033][U+002E][U+002E] / Test case for URL canonicalization.
..%252f.. / [U+002E][U+002E][U+0025][U+0032][U+0035][U+0032][U+0066][U+002E][U+002E] / Test case for URL canonicalization.
..%255c.. / [U+002E][U+002E][U+0025][U+0032][U+0035][U+0035][U+0063][U+002E][U+002E] / Test case for URL canonicalization.
..%c0%2f.. / [U+002E][U+002E][U+0025][U+0063][U+0030][U+0025][U+0032][U+0066][U+002E][U+002E] / Test case for URL canonicalization.
..%c0%af.. / [U+002E][U+002E][U+0025][U+0063][U+0030][U+0025][U+0061][U+0066][U+002E][U+002E] / Test case for URL canonicalization.
..%c1%1c.. / [U+002E][U+002E][U+0025][U+0063][U+0031][U+0025][U+0031][U+0063][U+002E][U+002E] / Test case for URL canonicalization.
..%c1%9c.. / [U+002E][U+002E][U+0025][U+0063][U+0031][U+0025][U+0039][U+0063][U+002E][U+002E] / Test case for URL canonicalization.
/À®./ / [U+002F][U+00C0][U+00AE][U+002E][U+002F] / Used with the previous test, specifically to test parsers—if the previous input is not an allowed sequence, then this should probably not be an allowed sequence.
\\?\C:\foo.txt / [U+005C][U+005C][U+003F][U+005C][U+0043][U+003A][U+005C][U+0066][U+006F][U+006F][U+002E][U+0074][U+0078][U+0074] / Tests the assumption that the local file location has the second character of a colon; NT specific.
\\127.0.0.1\C$\ / [U+005C][U+005C][U+0031][U+0032][U+0037][U+002E][U+0030][U+002E][U+0030][U+002E][U+0031][U+005C][U+0043][U+0024][U+005C] / Tests the assumption that the local file location has the second character of a colon; refers to the UNC localhost.
< / [U+0026][U+006C][U+0074][U+003B] / HTML sequence for the less-than sign.
  / [U+0026][U+006E][U+0062][U+0073][U+0070][U+003B] / HTML sequence for a non-breaking space.
<br> / [U+003C][U+0062][U+0072][U+003E] / HTML tag for a break.
&#65; / [U+0026][U+0023][U+0036][U+0035][U+003B] / Decimal HTML sequence for the letter A.
&#x0041; / [U+0026][U+0023][U+0078][U+0030][U+0030][U+0034][U+0031][U+003B] / Similar to previous example, but this is the hexadecimal HTML sequence for the letter A.
0xf / [U+0030][U+0078][U+0066] / May be assumed to be the hexadecimal reference to a number, in this case it would be 15.
0xa / [U+0030][U+0078][U+0061] / May be assumed that this is the hexadecimal reference to another number, in this case it would be converted to 10.
%UFF3C / [U+0025][U+0055][U+0046][U+0046][U+0033][U+0043] / URL encoded DBCS backslash.
Iiİı / [U+0049][U+0069][U+0130][U+0131] / Tests the two Latin Latter I’s and the two extra Turkish I’s.
<script>alert('Hello')</script> / [U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Script will pop up a Hello alert box if it is executed—should not be executed.
'<script>alert('Hello')</script> / [U+0027][U+003E][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to the previous example, except this will attempt to close a tag before the script.
"<script>alert('Hello')</script> / [U+0027][U+00322][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to the previous example; this will attempt to close a tag before the script.
<Script>alert('Hello')</Script> / [U+003C][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Using mixed case in the script, testing for an exact string match.
<sCript>alert('Hello')</sCript> / [U+003C][U+0073][U+0043][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0043][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to the previous example, using mixed case in the script, testing for an exact string match.
<SCRIPT>alert('Hello')</SCRIPT> / [U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054][U+003E] / Similar to the previous example, using all capitals in the script ,testing for an exact string match.
&#60;script&#62;alert('Hello')&#60;&#47;script&#62; / [U+0026][U+0023][U+0036][U+0030][U+003B][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036][U+0032][U+003B][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+0026][U+0023][U+0036][U+0030][U+003B][U+0026][U+0023][U+0034][U+0037][U+003B][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+0026][U+0023][U+0036][U+0032][U+003B] / Similar to the original script example, except this string has the symbols in their decimal HTML reference.
%22<script%20for=window %20event=%22onload()%22> document.write(%22Hello%22);document.close();</script> Hello%22);document.close();</script>.write(%22Hello%22) ;document.close();</script> / [U+0025][U+0032][U+0032][U+003E][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+0025][U+0032][U+0030][U+0066][U+006F][U+0072][U+003D][U+0077][U+0069][U+006E][U+0064][U+006F][U+0077][U+0020][U+0025][U+0032][U+0030][U+0065][U+0076][U+0065][U+006E][U+0074][U+003D][U+0025][U+0032][U+0032][U+006F][U+006E][U+006C][U+006F][U+0061][U+0064][U+0028][U+0029][U+0025][U+0032][U+0032][U+003E][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E][U+0077][U+0072][U+0069][U+0074][U+0065][U+0028][U+0025][U+0032][U+0032][U+0048][U+0065][U+006C][U+006C][U+006F][U+0025][U+0032][U+0032][U+0029][U+003B][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E][U+0063][U+006C][U+006F][U+0073][U+0065][U+0028][U+0029][U+003B][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0048][U+0065][U+006C][U+006C][U+006F][U+0025][U+0032][U+0032][U+0029][U+003B][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E][U+0063][U+006C][U+006F][U+0073][U+0065][U+0028][U+0029][U+003B][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+002E][U+0077][U+0072][U+0069][U+0074][U+0065][U+0028][U+0025][U+0032][U+0032][U+0048][U+0065][U+006C][U+006C][U+006F][U+0025][U+0032][U+0032][U+0029][U+003B][U+0064][U+006F][U+0063][U+0075][U+006D][U+0065][U+006E][U+0074][U+002E][U+0063][U+006C][U+006F][U+0073][U+0065][U+0028][U+0029][U+003B][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to the previous example, except this has all quotes and spaces URL escaped.
<script>(unencode("<script>alert('Hello')</script>"))</script> / [U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to previous examples, except this attempts to use the unencode function to get script to execute.
blah<script>(unencode("<script>alert('Hello')</script>"))</script> / [U+0062][U+006C][U+0061][U+0068][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to above examples, except this attempts to use the unencode function to get script to execute.
blah'<script>(unencode("<script>alert('Hello')</script>"))</script> / [U+0062][U+006C][U+0061][U+0068][U+0027][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to previous examples, except this attempts to use the unencode function to get script to execute and a single quote.
blah"<script>(unencode("<script>alert('Hello')</script>"))</script> / [U+0062][U+006C][U+0061][U+0068][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0028][U+0075][U+006E][U+0065][U+006E][U+006F][U+0064][U+0065][U+0028][U+0022][U+003C][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+0027][U+0048][U+0065][U+006C][U+006C][U+006F][U+0027][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E][U+0022][U+0029][U+0029][U+003C][U+002F][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+003E] / Similar to previous examples, except this attempts to use the unencode function to get script to execute and a double quote.
<SCRIPT LANGUAGE="VBScript"> MsgBox "Hello!" </SCRIPT> / [U+003C][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054][U+0020][U+004C][U+0041][U+004E][U+0047][U+0055][U+0041][U+0047][U+0045][U+003D][U+0022][U+0056][U+0042][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074][U+0022][U+003E][U+0020][U+004D][U+0073][U+0067][U+0042][U+006F][U+0078][U+0020][U+0022][U+0048][U+0065][U+006C][U+006C][U+006F][U+0021][U+0022][U+0020][U+003C][U+002F][U+0053][U+0043][U+0052][U+0049][U+0050][U+0054][U+003E] / VBScript of the previous example—alert box will pop up if it is executed.
<a href="JavaScript:alert()">link</a> / [U+003C][U+0061][U+0020][U+0068][U+0072][U+0065][U+0066][U+003D][U+0022][U+004A][U+0061][U+0076][U+0061][U+0053][U+0063][U+0072][U+0069][U+0070][U+0074][U+003A][U+0061][U+006C][U+0065][U+0065][U+0072][U+0074][U+0028][U+0029][U+0022][U+003E][U+006C][U+0069][U+006E][U+006B][U+003C][U+002F][U+0061][U+003E]
‹script›alert(‘Hello‘)‹/script› / [U+2039][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+203A][U+0061][U+006C][U+0065][U+0072][U+0074][U+0028][U+2018][U+0048][U+0065][U+006C][U+006C][U+006F][U+2018][U+0029][U+2039][U+2044][U+0073][U+0063][U+0072][U+0069][U+0070][U+0074][U+203A] / Symbols have been replaced with their high-bit counterparts.

HTML tags can include script where it may not be anticipated. Because these tags, and others, can include script with their attributes, they cannot be considered safe. The following lines contain some examples of how script can appear in what appear to be safe HTML tags.

<img src="JavaScript:alert()">img src</img>

<bgsound src="JavaScript:alert()">bgsound src</bgsound>

<iframe src="JavaScript:alert()">ifame src</iframe>

<table background="JavaScript:alert()">table background</table>

<object data="JavaScript:alert()">object data</object>

<frameset onload="JavaScript:alert()">frameset onload</frameset>

<body onload="JavaScript:alert()">body onload</body>

<body background="JavaScript:alert()">body background</body<span ID="ActiveX ID"</span>

Upper ASCII Character Combinations

In Table G.7 you will find upper ASCII (extended range) character combinations for use in verifying that your application can handle various valid upper ASCII input.

Table G.7 Upper ASCII Character Combinations

Characters / Unicode point / Comment
öÜß / [U+00F6][U+00DC][U+00DF] / High literals
Ü¢£ / [U+00DC][U+00A2][U+00A3] / High literals
©® / [U+00A0][U+00A9][U+00AE] / Problem literals
¿¾Õ / [U+00BF][U+00BE][U+00D5] / Regional literals
<" / [U+0026][U+003E][U+003C][U+0022] / Named entities
©®¾¿Õ / [U+00A0][U+00A9][U+00AE][U+00BE][U+00BF][U+00D5] / Literals
åE5å / [U+00E5][U+0045][U+35][U+E5] / Can be mistaken for the DOS delete mark
€\$\ / [U+20AC][U+005C][U+0024][U+005C]
’ / [U+00E2][U+20AC][U+2122]

Diacritics

Table G.8 contains the combining marks that can cause large problems and have no ANSI equivalent; these are typed in combination with another character to alter them (for example, typed in with c [u+0063] to create c¸ ).

Table G.8 Diacritics

Unicode point / Name
[U+0333] / Combining double lowline
[U+033F] / Combining double overline
[U+0327] / Combining cedilla

High-Bit Characters

The characters listed in Table G.9 aredifferent from their low-bit counterparts and often end up converted to their low-bit counterparts when the software cannot handle them. For instance, try taking script and substituting in the correlating high-bit characters to see if a filter allows them through and another component downgrades them, with the end result of script being executed. These characters can also be problematic on their own as input.

Table G.9 High-Bit Characters

Characters / Unicode point / Name
[U+00AD] / Soft hyphen (SHY)
‘ / [U+2018] / Single opening quote
’ / [U+2019] / Single closing quote
“ / [U+201C] / Double opening quote
” / [U+201D] / Double closing quote
´ / [U+00B4] / Acute accent
¸ / [U+00B8] / Cedilla
[U+00A0] / Non-Breaking Space (NBSP)
© / [U+00A9] / Copyright
® / [U+00AE] / Registered Mark
™ / [U+2122] / Trademark
– / [U+2013] / En-dash
— / [U+2014] / Em-dash
… / [U+2026] / Ellipsis
⁄ / [U+2044] / Fraction Slash
‹ / [U+2039] / Single Left-Pointing Angle
› / [U+203A] / Single Right-Pointing Angle
′ / [U+2032] / Prime
″ / [U+2033] / Double Prime

Characters from Multibyte Character Sets

The rest of the tables in this appendix deal with double-byte characters and single-byte characters from the multibyte code pages.

Boundary Cases

Table G.10 contains characters for testing the first and last characters of the various multibyte code page ranges.

Table G.10 Boundary Cases for the Multibyte Code Pages

Characters / Unicode point / Comment
[U+3000]
[81/40] in 932, [A1/A1] in 949 and 936, [A1/40] in 950 / Ideographic space — beginning of first DBCS range on 932 code page
滌 / [U+6EEC]
[9F/FC] in 932 / End of first DBCS range on 932 code page
。 / [U+FF61]
[A1] in 932 / Beginning of Kana (single byte range) on 932 code page
゚ / [U+FF9F]
[DF] in 932 / End of Kana
漾 / [U+6F3E][E0/40] in 932 / Beginning of Second DBCS range on 932 code page
黑 / [U+9ED1]
[FC/4B] in 932 / End of Second DBCS on 932 code page
 / [U+E4C6]
[A1/40] in 936 code page / Beginning of CHS 936 code page
 / [U+E4C5]
[FE/FE] in 936 code page / End of CHS 936 code page
 / [U+EEB8]
[81/40] in 950 code page / Beginning of CHT 950 code page
 / [U+E310]
[FE/FE] in 950 code page / End of CHT 950 code page
갂 / [U+AC02]
[81/41] in 949 code page / Beginning of Korean 949 code page
詰 / [U+8A70]
[FD/FE] in 949 code page / End of Korean 949 code page

Testing Individual Bytes that Make up the Double-Byte Character

Since the double-byte characters consist of 2 bytes read in individually, either one of the bytes could be mistaken for a special lower ASCII character. Because of this, you need to look at the special meaning of the lower ASCII characters and take the code point that they occupy to identify double-byte characters that have that code point as either a leading byte or a trailing byte (see Tables G.11 through G.16).

Table G.11 Lead Byte Is 81

Character / Unicode code point / Code point
ー / [U+30FC] / [81/5B] on 932 code page
‐ / [U+2010] / [81/5D] on 932 code page
\ / [U+FF3C] / [81/5F] on 932 code page
+ / [U+FF0B] / [81/7B] on 932 code page
- / [U+FF0D] / [81/7C] on 932 code page
± / [U+00B1] / [81/7D] on 932 code page
× / [U+00D7] / [81/7E] on 932 code page

Table G.12 Trailing Byte is 5C (ANSI Backslash Character—Need to Use as First, Middle, and Last Character in a String)