Lecture Notes #13

Lecture Notes #13

Lecture Notes #13

CMPT 408/765 – Nov 4, 2009 & Nov 6, 2009

Taken by: Yvonne Cheng

Compression

There are 2 types of compression:

Lossless

  • You can get original file back when you perform uncompressing
  • For example: zip

Lossy

  • You cannot get original file back when you perform uncompressing
  • For example: mpeg

Compressibility

Compressibility of a string is measured by entropy.

A string with high entropy cannot be compressed well. For example, a uniform string has highest entropy and it cannot be compressed.

Factors

Factors we need to consider for compression:

Efficiency

Size (It’s good to have it as small as possible)

Idea

For most languages, there are some characters used more often than the others. For example, in English text, there is an inordinate number of “e”.

We can find the most frequent letter and represent it with shortest code, and represent those less frequent letters with longer codes.

Variable Length Coding

Variable length code: each character gets a variable length bit representation.

If the text highly bias on some characters, we can get a much shorter coding text. In the other hand, if the text is uniform for each character, the result will not be good.

Difficulty: How do we know the letter boundary? We need to parse the code and convert it back to readable letters.

Example:

e represented by “0”

a represented by “00”

If we see “000”, what is it?

The answer could be “eee”, “ea”, or “ae”.

Ambiguous parsing

Problem: “e” is a prefix of “a”

We need: The code to be “prefix-free”. No code of a character is a prefix of another.

Huffman Code

Huffman code has following properties:

Variable length representation

Binary coding

Depending on character frequency (Note: It is also the weakness)

Our assumption: We know the frequency of all the letters.

Alphabet ∑ where |∑| = d

What we can do: Build a Huffman tree which is going to be our dictionary (i.e. mapping of characters and code).

In this way, we share the dictionary implicitly without the need of transmitting. The receiver can build the dictionary himself by using letter frequencies (which is explicit knowledge in our assumption).

Algorithm:

  1. Combine 2 uncombined nodes with lowest frequencies into 1 node whose frequency is the sum of the 2 nodes.
  2. Repeat Step 1 until there is only one node left

Example #1:

b: 0

c: 10

d: 110

e: 1110

f: 1111

f

Note: All codes are prefix-free!

Example #2:

a: 010

b: 0110

c: 0111

d: 10

e: 11

f : 00

Analysis:

In what kind of condition we will get a complete tree?

We will get one when we have equal frequency/probability for all nodes.

Bad compression

More balanced tree means worse compression ability

Assume we know nothing about structure. We consider only character frequency here without considering substrings. For example, “abababab” has a structure and easy to compress if we code by substring, but if we consider it by letter, it is a uniform string and has no compression ability.

LZ77 (Lempel-Ziv 77)

This is a lossless compression with following properties:

Variable length encoding (The length is not only varied for code representing each character. The length of substring being coded is also variable in length.)

No prior knowledge required (Knowledge about character frequency)

No explicit dictionary (We only need the coded string)

One pass algorithm (For both encoding and decoding)

LZ77 uses Sliding window algorithm.

Once the sliding window has passed by, there is no more access to previous part of the string.

Sliding Window

Search buffer:

  • Already seen, being indexed/coded
  • Can be considered as implicit/adaptive dictionary

The sizes of the two buffers are parameters

The assumption is the pattern will reoccurred within the range of search buffer. We find matches only from search buffer window.

Encoding Algorithm

  1. Set the position to the beginning of the input string
  2. Find the longest match in sliding window for the first character/substring in look-ahead buffer from search buffer
  3. Output the tuple (X,Y,Z) with the following meaning:
  4. X is the offset, in terms of numbers of characters to the left
  5. Y is the longest length of match in search buffer
  6. Z is the character following the longest match
  7. Move the sliding window to right until the substring just coded above is located at the very end of search buffer. If the look-ahead buffer is not empty, repeat Step 2.

Example

Search buffer / Look-ahead buffer / Output
a b r a c / a d a b r a d
a / b r a c a / d a b r a d / (0,0,a)
a b / r a c a d / a b r a d / (0,0,b)
a b r / a c a d a / b r a d / (0,0,r)
a b r a c / a d a b r / a d / (3,1,c)
a b r a c a d / a b r a d / (2,1,d)
a b r a c / a d a b r a d / (7,4,d)

(0,0,a) means “a” is first seen. Since there is no “a” ahead, we can find “a” with 0 offset following by 0 length of maximum match, and the following letter is “a”.

(3,1,c) means we can find “a” by counting ahead by 3 characters, with length 1 of maximum match, and the following letter is “c”.

The output string is (0,0,a) (0,0,b) (0,0,r) (3,1,c) (2,1,d) (7,4,d)

Decoding Algorithm

  1. Read a tuple (X,Y,Z) from the input
  2. Output the sequence (using the same search buffer part of sliding window) specified by X and Y, and follow by character Z.
  3. Repeat Step 1 and 2 until there is no more tuple.

Example

Search buffer / Input
a / (0,0,a)
a b / (0,0,b)
a b r / (0,0,r)
a b r a c / (3,1,c)
a b r a c a d / (2,1,d)
a b r a c / a d a b r a d / (7,4,d)

Decoding is also 1-pass algorithm using the same search window

Observation

The tuple (X,Y,Z):

  • X: offset in terms of number of characters to the left
  • Y: length of match in search window
  • Z: the character following the longest match

We use dynamic dictionary in this algorithm

LZW Compression

This is a variance of LZ77. It’s also a lossless compression.

Encoding

Our assumption is the only alphabet we have is “a” and “b”.

∑ = { a, b }

For each character, we assign a code number as shown in the following code table:

The code table is known before we start encoding and decoding.

(Note: For LW77, we implicitly keep code table in the search buffer window.)

The encoding algorithm is as follow:

  1. We go from Left to Right of the input string
  2. We try to find longest prefix p from where we are in the string, and get its matching code
  3. Represent p by its matching code pcode.
  4. Assume p is followed by c, insert pc into the code table.

Note:

The extended code table is generated again by the original code table when we do decoding.

We get better and better compression along the string.

Example

Input:aba ba bb ab aa b ba b b aa b b a

Code:0 1 2 2 3 3 5 8 8