A Java-Based Intelligent Compressor of Heterogeneous Hybrid Multimedia Documents

A Java-Based Intelligent Compressor of Heterogeneous Hybrid Multimedia Documents

Improving Communication in e-Commerce Applications

1.2.1.1.1.1.1.1MAJED AL-MASHARI* AND YOSOUF AL-OHALI

College of Computer & Information Sciences,

King Saud University

P.O. Box 51178, Riyadh 11543

SAUDI ARABIA

* (Contact author)

Abstract: - Electronic commerce is emerging as the new Internet business model. In electronic commerce, most net traffic is multimedia-based, which means that there is a real need for an efficient hybrid multimedia compression technique that improves the communication channel performance. Most document files consist of a combination of text, graphic, and picture segments and consequently require different compression algorithms for their efficient coding. This combination of data usually comes in different file formats. Therefore, a technique is needed to recognize each of these formats before the optimal compression algorithms can be applied. This paper proposes a framework for improving the performance of electronic commerce through an efficient and intelligent hybrid compression technique which is able to separate, automatically, images from text data, and recognize different image file formats in a data stream being transmitted.

Key-Words: - e-commerce, communication, compression techniques

1

2Introduction

With multimedia being the major medium of communication in electronic commerce, there is an increased need to improve effective bandwidth, since these applications are currently severely straining the available bandwidth. Whilst in the past most net traffic has been text based, with file sizes measured in hundreds of kilobytes, the use of the Hypertext Markup Language (HTML) protocol to serve multimedia documents has increased file sizes to megabytes. This encroachment of bandwidth results in long delays in accessing documents, limiting information availability to less users in a given time. A solution would be to format the transmitted data in such a way that it takes up less space - a technique known as compression [1].

The problem with compressing multimedia documents is that the files to be sent can contain a mixture of text and image data, and the image data can be in different formats (e.g. Portable Grey Map (PGM), Graphics Interchange Format (GIF), Joint Photographic Experts Group (JPEG), etc.). A general-purpose compression utility would have to detect image data, recognize its format, perhaps compress it, and then send or store the data.

This paper addresses the problems mentioned above by proposing a technique for developing an intelligent compression utility system. While the proposed technique is beneficial in satellite-based broadcasting of product catalogues for distributed information systems, its development and implementation will have much wider applicability.

3Improving Communication in electronic Commerce

Compression is the process of deriving more efficient (i.e., smaller) representations of data. There are two main strategies for compression. One is the reduction of redundancies (repetitive information), and the other is carried out by exploiting the characteristics of the human visual system. Based on an accuracy ratio of the reconstructed data, the lossless method is 100% accurate while the lossy method involves errors, in the reconstructed data, which may or may not be acceptable [2]. Lossless image compression techniques are necessary for some applications, e.g. medical imaging, where not only high quality is demanded, but unaffected archiving is a crucial requirement. These techniques are also important for data files that are used by applications software where loss of information is not acceptable, e.g. text documents and program executables. Various lossless algorithms and techniques are currently in use. Run-length encoding (RLE), Huffman coding, and arithmetic coding are among the most frequently implemented algorithms. RLE takes advantage of the fact that, in many data streams, consecutive single tokens are often the same. RLE checks the stream for this fact and inserts a special token each time a chain of more than two equal input tokens are found. RLE is easily implemented, and fast, but its compression ability is very limited[3].

Huffman coding is a probabilistic technique that uses special binary trees to construct minimum variable-length codewords for symbols. It uses the strategy of analyzing the data stream, which needed to be compressed, and assigning codes on a symbol frequency basis. Once the data has been analyzed, a binary tree is constructed to represent the coded symbols. The individual symbols are leaves in the tree; their codes are indicated by the path to each leaf from the root of the tree. Tracing the path from the root of the tree to a specific leaf yields the code for the symbol found at that leaf. The number of branches taken will be the same as the length of the code. Longer codes indicate a character found at a deeper level of the tree whilst shorter codes indicate a more shallow level. Huffman Coding becomes more efficient when applied to correlated data. That is why it gives a good compression ratio on images, and a higher ratio on colored ones[12].

Arithmetic coding basically works by first defining an interval. Assuming a simple binary case for each successive symbol in the message, the current interval is subdivided into sub intervals, one for each next possible symbol. The size of each sub interval is relative to the probability that the symbol is the next symbol in the message according to the source model. Then, depending on the next symbol which actually occurs in the message, one of the two new subintervals is selected as the current interval. After all the symbols in the message have been processed in this way a codeword is encoded which uniquely identifies the final current interval apart from all other possible final intervals [4].

With lossy image compression techniques, the reconstructed image will contain some error (i.e. it will differ from the original image). This error should be small enough so that it does not distort the image quality. This loss of information allows for greater compression ratios to be achieved. The amount of information loss can normally be controlled by a quality factor provided by the user. Discrete Cosine Transform (DCT) and Wavelet transforms are the most widely used for the lossy compression. This scheme involves subdividing an NxN image into smaller nxn blocks and performing the transform on each subimage. The goal of the transform is to decorrelate the original signal, which results in the signal energy being redistributed among only a small set of transform coefficients. In this way, many coefficients may be discarded after quantization. Quantization refers to the process of removing transform-based information that is irrelevant to the human visual system.

4Current Techniques

By applying compression to a file, it shrinks to a fraction of its original size. The smaller size means that the file can be sent over the World Wide Web much faster, which results in a faster browsing for web sites. Lossy JPEG compression and lossless GIF compression are the most common image format on the WWW, which have gained a wide support, and most graphical Web browsers now support them.

The most popular image format compression schemes, such as RLE technique and International Telegraph and Telephone Consultative Committee (CCITT) standard are lossless compression methods. An image compressed using a lossless method is guaranteed to be identical to the original image when uncompressed [5]. JPEG is primarily a lossy method of compression. It was designed specifically to discard information that the human eye cannot easily see. The amount of compression achieved depends upon the content of the image data. Higher compression ratios will result in image files that differ noticeably from the original image, but that still have an overall good image. Like other forms of lossy compression, JPEG takes the image data and performs a regularizing process that makes it contain more repeating patterns that it did originally. The data is first recorded using a mathematical function called the DCT. the DCT analyzes the image’s spatial frequency components both horizontally and vertically [6]. The lossy part of the process is quantization, which is done by using a 8x8 quantization matrix of step sizes (quanta) - one element for each DCT coefficient. Step sizes will be small in the upper left (low frequencies), and large in the upper right (high frequencies). The quantizer divides the DCT coefficients by its corresponding quantum, then rounds to the nearest integer. Large quanta drive small coefficients down to zero. The result: many high frequency coefficients becomes zero, and therefore easier to code[7]. The resulting values are then reordered in the zig-zag sequence to increase the length of runs, and then compressed losslessly, first with RLE, then Huffman encoding. An end user can adjust the quality of a JPEG encoder using a parameter called quality factor (Qfactor). The optimal Qfactor depends on the image content and is, therefore, different for every image. The decompression process simply reverses the entire process. The data is expanded using Huffman decoding, the resulting values are multiplied, and then an inverse DCT is applied.

Until the advent of JPEG, GIF was the dominant graphics format. The UNISYS Corp. and CompuServe, devised the GIF format, to facilitate transmission of graphical images over phone lines via modems. Because CompuServe is accessed by users of a wide variety of different computer systems, the format was designed for portability. GIF uses a form of Huffman coding which, modified slightly for images, yields the high compression rates that the GIF format achieves. GIF is limited to the storing of 8-bit (256 colors ) color images [8].

Other file formats include PGM, Postscript (PS), and Microsoft Windows Bitmap (BMP). All these three files are uncompressed format. PS format language was designed as a typesetting language for both text and graphics. BMP format was brought by Microsoft, and implemented in the PC Paintbrush program[8]. PGM is capable only of manipulating gray-scale images.

BTPC, Binary Tree Predictive Coding, is a compressed format designed within the University of Waterloo’s Department of System Design Engineering by John Robinson. This compression method is applicable for static images, and, in contrast to JPEG and GIF, designed to perform both lossless and lossy compression, and to be effective for both photos and graphics. It is also suitable for compressing multimedia images, which integrate two or more types of visual material. Typical examples are document pages, containing both text, images and annotated photographs. However, this method is a general-purpose in the sense it must compress photos, graphics, text, with or without loss [11]. Eicon Loadable Module (ELM) is a current compression project being developed by Eicon Technologies. Its aim is to come up with an intelligent mechanism for choosing the optimal compression algorithm based on the data being transferred (text, graphics, audio, binary data, … etc.). This scheme may be implemented as either a separate sub-network layer which resides above X.25 as an extra layer or as a service within the X.25 protocol itself[1].

5A Proposed Framework

Today, there is an extraordinary and doubling growth in using the Internet every year. Many computers are now having direct Internet connectivity, and are exchanging document files using a standard language, HyperText Transfer Protocol (HTTP), for communication. Document files are structured according to specific formats, depending on the applications used in their creation. There are several groups under which lie lots of files formats. They are all about storing, organizing, and retrieving data in an efficient and logical way. For applications software to be able to recognize and process files, it is necessary that they be created according to their predefined standards and rules. Any slight differences result in an inability to read these files.

Currently, there are a lot of widely-used applications software that produce different file formats for hybrid multimedia data (e.g. mixture of text and images). Transmitting such files, which might be of huge sizes, in uncompressed form, means increasing the time needed to submit them to the client. Compressing the overall hybrid document using lossy techniques, e.g. JPEG, results in losing some important text data, which makes the receiver unable to process them using their original producing applications. On the other hand, using lossless techniques results in a smaller compression ratio for images. The solution is to make a trade-off between the two methods and apply them together on the same hybrid multimedia document being compressed.

As it has been explained before, text compression follows different algorithms from those used for lossy image compression. Therefore, a mixed-media compression system should be designed to efficiently handle the internal contents of the hybrid file, and then produce a compressed file that can be then launched, possibly, along with its decompressor over the net to the client side. Once the compressed file is received by the client, it should automatically be decompressed to return back into its original form. However, the reconstructed file will surely differ from the original one. But the differences will only be on the images part, and not on the text part that might contain some headers, or control characters [If non-standard techniques are used, it may be necessary to transmit the decompressor algorithm along with the data].

Many hybrid document files embed images in a certain format with the text arranged within the documents in different ways. This requires detecting images from text in a given hybrid document, and also recognizing the format in which images are stored. It would then be possible to incorporate the optimal algorithms, in terms of quality and compression ratio, for both text and images.

The implementation of the proposed technique starts with the problem of recognizing a given image format that is embedded in a data file. To simplify the problem, a format recognizer is desired to detect the format type of a given unknown file. This implies recognizing a unique characteristic of each image format. One common approach is producing the histogram of a file sample in order to look for the common features for each format on their own. In fact, an uncompressed image would display the same spatial similarities (between adjacent pixels, but significant discontinuities at the end of each row/start of the next). A compressed file would not be expected to have such a feature. For an uncompressed file (e.g. PGM), the differences between adjacent pixels in a file sample if calculated and plotted using a histogram, it would be possible to notice a peak near zero, and then sail off with another peak around the discontinuity value.

Experiments demonstrate that moving the file sampling points forward and backward shows that there is a noticed changes in the frequencies, especially for those at the beginning of the file where the header is located. Based on charts of the actual pixel values themselves were drawn, it is noticeable that each file format has its own signature. In the case of a PGM file, it is possible to determine the width of image by finding the period of the (almost) cyclic data. With several experiments carried out on files of different sizes, it is evident that they all give almost the same results. In particular, the header parts are almost identical. In fact, there are certain features of an image header which are characteristics of that format. By examining a wide range of JPEG images, for instance, the almost-linear slope part of the way along is a standard feature (this corresponds to the table of Huffman codes). Likewise, a GIF file seems to have characteristic features in its header. The implementation of the proposed technique should seek the development of a number of these feature recognizers, which scan a data stream looking for these characteristic header features. Once recognized, more detailed, format-specific analysis should be able to recognize the various parts of the header, and enable full image detection. Once detected, the image can be conceptually reconstructed and, if appropriate, compressed using a lossy technique.

Clearly, the programming language chosen to implement and demonstrate this system should provide certain features:

Supported by the World Wide Web (WWW) browsers, so the decoder can be embedded into the HTML document to deliver the restructured document based on a client-side functionality.

Platform independence and portability, so the system can run on different machines (Macs, PCs, and Unix) attached to the Internet.

These features are most closely matched by Java, a modern object-oriented language. Java has been predicted to become the implementation language of choice in the future for implementing Internet-based applications. With Java, one can write two kinds of programs namely, applets designed to be transported over the Internet and executed in WWW browsers like Netscape Navigator, and stand-alone applications designed to execute on the local machine [9]. It is possible to store the executable files of conventional programs, written on C or C++, on the server along with the HTML page, and write your HTML in a way that would allow the Web browser to automatically load the program from the server and execute it on the client’s computer. However, several problems arise when attempting to do that. Firstly, the programmer would have to write his/her own functions to perform even the most basic operations, such as loading an image file from the server to be displayed on the client machine. Web programming in a conventional, non-Web language would also lead to complicating the Web programming when programmers reinvent the wheel trying to solve the same problems in different ways. In addition, conventional programs generate rather large executables, which means loading them over the communication line would take a long time.