File Format - Wikipedia, the Free Encyclopedia

Magic numbers and File Formats

In computer programming, the term magic number has multiple meanings. It could refer to:

· a constant used to identify a file format or protocol;

· an unnamed and/or ill-documented numerical constant; or

· distinctive debug values or GUIDs, etc.

Magic number origin

The type of magic number was initially found in early Seventh Edition source code of the Unix operating system and, although it has lost its original meaning, the term magic number has become part of computer industry lexicon.

When Unix was ported to one of the first DEC PDP-11/20s it did not have memory protection and, therefore, early versions of Unix used the relocatable memory reference model.[1] Thus, pre-Sixth Edition Unix versions read an executable file into memory and jumped to the first low memory address of the program, relative address zero. With the development of paged versions of Unix, a header was created to describe the executable image components. Also, a branch instruction was inserted as the first word of the header to skip the header and start the program. In this way a program could be run in the older relocatable memory reference (regular) mode or in paged mode. As more executable formats were developed, new constants were added by incrementing the branch offset.[2]

In the Sixth Edition source code of the Unix program loader, the exec() function read the executable (binary) image from the file system. The first 8 bytes of the file was a header containing the sizes of the program (text) and initialized (global) data areas. Also, the first 16-bit word of the header was compared to two constants to determine if the executable image contained relocatable memory references (normal), the newly implemented paged read-only executable image, or the separated instruction and data paged image.[3] There was no mention of the dual role of the header constant, but the high order byte of the constant was, in fact, the operation code for the PDP-11 branch instruction (000407 or 0x0107). Adding seven to the program counter showed that if this constant was executed, it would branch the Unix exec() service over the executable image eight byte header and start the program.

Since the Sixth and Seventh Editions of Unix employed paging code, the dual role of the header constant was hidden. That is, the exec() service read the executable file header (meta) data into a kernel space buffer, but read the executable image into user space, thereby not using the constant's branching feature. Magic number creation was implemented in the Unix linker and loader and magic number branching was probably still used in the suite of stand-alone diagnostic programs that came with the Sixth and Seventh Editions. Thus, the header constant did provide an illusion and met the criteria for magic.

In Version Seven Unix, the header constant was not tested directly, but assigned to a variable labeled ux_mag[4] and subsequently referred to as the magic number. Given that there were approximately 10,000 lines of code and many constants employed in these early Unix versions, this indeed was a curious name for a constant, almost as curious as the You are not expected to understand this.[1] comment used in the context switching section of the Version Six program manager. Probably because of its uniqueness, the term magic number came to mean executable format type, then expanded to mean file system type, and expanded again to mean any strongly typed file.

Magic numbers in files

Magic numbers are common in programs across many operating systems. Magic numbers implement strongly typed data and are a form of in-band signaling to the controlling program that reads the data type(s) at program run-time. Many files have such constants that identify the contained data. Detecting such constants in files is a simple and effective way of distinguishing between many file formats and can yield further run-time information.

Some examples:

· Compiled Java class files (bytecode) start with 0xCAFEBABE.

· GIF image files have the ASCII code for 'GIF89a' (0x474946383961) or 'GIF87a' (0x474946383761)

· JPEG image files begin with 0xFFD8 and end with 0xFFD9. JPEG/JFIF files contain the ASCII code for 'JFIF' (0x4A464946) as a null terminated string. JPEG/Exif files contain the ASCII code for 'Exif' (0x45786966) also as a null terminated string, followed by more metadata about the file.

· PNG image files begin with an 8-byte signature which identifies the file as a PNG file and allows immediate detection of common file transfer problems (the signature contains various newline characters for detection of unwarranted automated newline conversion, for example, if the file is transferred over FTP with the "ASCII" transfer mode instead of the "binary" mode): \211 P N G \r \n \032 \n (0x89504e470d0a1a0a)

· Standard MIDI music files have the ASCII code for 'MThd' (0x4D546864) followed by more metadata.

· Unix script files usually start with a shebang, '#!' (0x2321) followed by the path to an interpreter.

· PostScript files and programs start with '%!' (0x2521).

· PDF files start with '%PDF'.

· Old MS-DOS .exe files and the newer Microsoft Windows PE (Portable Executable) .exe files start with the ASCII string 'MZ' (0x4D5A), the initials of the designer of the file format, Mark Zbikowski. The definition allows 'ZM' as well but it is quite uncommon.

· The Berkeley Fast File System superblock format is identified as either 0x19540119 or 0x011954 depending on version; both represent the birthday of author Marshall Kirk McKusick.

· Executables for the Game Boy and Game Boy Advance handheld video game systems have a 48-byte or 156-byte magic number, respectively, at a fixed spot in the header. This magic number encodes a bitmap of the Nintendo logo.

· Zip files begin with 'PK', the initials of Phil Katz, author of DOS compression utility PKZIP.

· Old Fat binaries (containing code for both 68K processors and PowerPC processors) on Classic Mac OS contained the ASCII code for 'Joy!' (0x4A6F7921) as a prefix.

· TIFF files begin with either "II" or "MM" depending on the byte order (II for Intel, or little endian, MM for Motorola, or big endian), followed by 0x2A00 or 0x002A (decimal 42 as a 2-byte integer in Intel or Motorola byte ordering).

· Unicode text files encoded in UTF-16 often start with the Byte Order Mark to detect endianness (0xFEFF for big endian and 0xFFFE for little endian). UTF-8 text files often start with the UTF-8 encoding of the same character, 0xEFBBBF.

The Unix utility program file can read and interpret magic numbers from files, and indeed, the file which is used to parse the information is called magic. The Windows utility TrID has a similar purpose.

Magic numbers in protocols

· In the RFB protocol used by VNC, a client starts its conversation with a server by sending "RFB" (0x524642, for "Remote Frame Buffer") followed by the client's protocol version number.

· In the SMB protocol used by Microsoft Windows, each SMB request or server reply begins with 0xff534d42, or "\xffSMB" at the start of the SMB request.

· In the MSRPC protocol used by Microsoft Windows, each TCP-based request begins with 0x05 at the start of the request (representing Microsoft DCE/RPC Version 5), followed immediately by a 0x00 or 0x01 for the minor version. In UDP-based MSRPC requests the first byte is always 0x04.

· DCOM object instantiation requests carried over MSRPC have a huge container object called ORPC This. This contains smaller object reference structures, called OBJREFs, which are always prefixed with the byte sequence "MEOW". Debugging extensions (used for DCOM channel hooking) are prefaced with the byte sequence "MARB".

· Unencrypted BitTorrent tracker requests begin with a single byte, 0x13 representing the header length, followed immediately by the phrase "BitTorrent protocol" at byte position 1.

· eDonkey/eMule traffic begins with a single byte representing the client version. Currently 0xe3 represents an eDonkey client, 0xc5 represents eMule, and 0xd4 represents compressed eMule.

· SSL transactions always begin with a "client hello" message. The record encapsulation scheme used to prefix all SSL packets consists of two- and three- byte header forms. Typically an SSL version 2 client hello message is prefixed with a 0x80 and SSLv3 server response to a client hello begins with 0x16 (though this may vary).

· DHCP packets use a "magic cookie" value of 0x63825363 at the start of the options section of the packet.

File format

A file format is a particular way to encode information for storage in a computer file.

Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor documents, there will typically be several different formats. Sometimes these formats compete with each other.

Generality

Some file formats are designed to store very particular sorts of data: the JPEG format, for example, is designed only to store static photographic images. Other file formats, however, are designed for storage of several different types of data: the GIF format supports storage of both still images and simple animations, and the QuickTime format can act as a container for many different types of multimedia. A text file is simply one that stores any text, in a format such as ASCII or UTF-8, with few if any control characters. Some file formats, such as HTML, or the source code of some particular programming language, are in fact also text files, but adhere to more specific rules which allow them to be used for specific purposes.

It is sometimes possible to cause a program to read a file encoded in one format as if it were encoded in another format. For example, one can play a Microsoft Word document as if it were a song by using a music-playing program that deals in "headerless" audio files. The result does not sound very musical, however. This is so because a sensible arrangement of bits in one format is almost always nonsensical in another.

Specifications

Many file formats, including some of the most well-known file formats, have a published specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program treats a particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format developers view their specification documents as trade secrets, and therefore do not release them to the public. Second, some file format developers never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.

Identifying the type of a file

Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the filesystem—an example of metadata. Different operating systems have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.

Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.

Filename extension

One popular method in use by several operating systems, including Mac OS X, CP/M, DOS, VMS, VM/CMS, and Windows, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is known as the filename extension. For example, HTML documents are identified by names that end with .html (or .htm), and GIF images by .gif. In the original FAT filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as 8.3 filename. Many formats thus still use three-character extensions, even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse the operating system and consequently users.

One feature of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly. This led more recent operating system shells, such as Windows 95 and Mac OS X, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.

Magic number

An alternative method, often associated with Unix and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE, or, for XHTML, the XML identifier, which begins with <?xml. The files could also begin with any random text or several empty lines, but still be usable HTML.