TLT-5400/5406 DIGITAL TRANSMISSION, Exercise 1, Spring 2012
Problem 1.
In general, the entropy H(X) of a discrete random variable X is defined as
So now we can write the solution directly as
where we have used the facts that logy(yx) = x and logy(1/x) = –logy(x).
Then, the rate of the source is given by
Following encoding "rule" will achieve the entropy limit (and is also feasible to decode since no codeword is a prefix of any other codeword):
Outcome / Probability / Codeword / Codeword lengtha1 / 1/2 / 0 / 1 bit
a2 / 1/4 / 10 / 2 bits
a3 / 1/8 / 110 / 3 bits
a4 / 1/8 / 111 / 3 bits
The average number of bits produced by the coder is now
meaning that in this example we indeed achieved the entropy limit. It is, however, not necessarily obvious when it is possible to reach the entropy limit in general. This issue will be considered in Problem 2.
Problem 2.
Kraft's inequality for code word lengths is given by
where M is the number of code symbols (=2 for binary codes), K is the number of source words (and so also the number of code words) and li is the length of the i-th code word. It can be shown that every code (prefix or not) has to satisfy the above inequality to be uniquely decodable.
In general, the average code word length is then defined as (this is the natural definition that we used already in the Exercise 1)
where Pi is the probability of the i-th source word. Then, the target is to minimize given that the Kraft's inequality is fulfilled (and given of course that li‘s are integers and > 0 for all i).
Clearly, if we are able to choose
(verify that these codeword lengths li fulfill the Kraft’s ineq!)
it follows that the average code word length is given by
.
So it really is possible to obtain the average codeword length equal to the source word entropy. Of course, in practice, this is only possible if the source word probabilities are of certain kind (i.e., –logM(Pi) is integer " i).
More generally, we can "round up the above idea" by choosing
where stands for the smallest integer number ³ x. Then, it follows that also these lengths li will satisfy the Kraft's inequality since
.
Based on this, we really can choose the code word lengths in this way (that is, based on the Kraft's inequality, there really is a code with these code word lengths li) and we can write
(by definition of )
or simply
Then, multiplying by Pi ( ³ 0 ) and summing over i, we get
. (q.e.d.)
Once more, the lower bound can be reached iff we are able to set
.
© Mikko Valkama / TUT 2 / 14
TLT-5400/5406 DIGITAL TRANSMISSION, Exercise 1, Spring 2012
Problem 3.
Source word probabilities 0.62, 0.23, 0.10, 0.02, 0.02, 0.01.
a) binary Huffman code, M = 2, code symbols 0 and 1:
0.62 / 0.62 / 0.62 / 0.62 / 0.62 / 1.00.23 / 0.23 / 0.23 / 0.23 / 0.38
0.10 / 0.10 / 0.10 / 0.15
0.02 / 0.03 / 0.05
0.02 / 0.02
0.01
So the code words are: w1 = 0, w2 = 10, w3 = 110, w4 = 1111, w5 = 11100, w6 = 11101. The average code length is then = 0.62*1 + 0.23*2 + 0.10*3 + 0.02*4 + 0.02*5 + 0.01*5 = 1.61 bits. The source word entropy in turn is = 1.54 bits < (of course).
b) ternary Huffman code, M = 3 => s = 2, code symbols a, b and c (as an example):
0.62 / 0.62 / 0.62 / 1.00.23 / 0.23 / 0.23
0.10 / 0.10 / 0.15
0.02 / 0.03
0.02 / 0.02
0.01
So the code words are: w1 = c, w2 = b, w3 = ac, w4 = aa, w5 = abb, w6 = aba. The average code length is now = 0.62*1 + 0.23*1 + 0.10*2 + 0.02*2 + 0.02*3 + 0.01*3 = 1.18 ternary units. The source word entropy is = 0.97 ternary units < (of course, notice the 3-base logarithm due to ternary symbols).
© Mikko Valkama / TUT 7 / 14
TLT-5400/5406 DIGITAL TRANSMISSION, Exercise 1, Spring 2012
Problem 4.
The following discrete memoryless channel is considered:
Let's denote the input probabilities as PX(x1) = q and PX(x2) = 1 – q (why?). Then, by definition, we can write the conditional entropy of Y given X as
or (use the conditional probabilities in the Figure !)
Then, the actual mutual information is given by
In general, in order to evaluate H(Y), we need to find the probabilities for yi. So according to the total probability formula, we can first write
·
·
·
Then, the entropy H(Y) is given by
This can be simplified further to
Then we can substitute this into the expression for I(X,Y) which yields
The actual channel capacity CS is in turn obtained when the above mutual information I(X,Y) is maximized over the input distribution. So we should choose the value of q in such a way that the expression for I(X,Y) above is maximized. In this case, the maximization is trivial since
What is left is the binary entropy function which is maximized when q = 1/2. So the channel capacity is then
To once more summarize the main results (for this problem)
·
·
· (e.g., if p = 0 => CS = 1 and if p = 1 => CS = 0 which conform with our intuition (see the channel model))
Problem 5.
In order to determine the channel capacity, we usually find out an expression for the average mutual information and maximize it with respect to the source probabilities. However, there is an easier way to determine the channel capacity if the channel has certain symmetry (this usually makes the actual calculations much more simple to carry out !). To check the symmetry condition, it is convenient to write the conditional probabilities into a conditional probability matrix P = {pij} = {PY|X(yi|xj)}. Then, if each row (column) of P is just a permutation of any other row (column), equally probable input symbols will maximize the mutual information !!! “Physically” this means that under such symmetry condition, H(Y|X) turns out to be independent of the source probabilities and also equally probable input symbols make the output values each equally probable. Then clearly, is maximized.
For our channel, the conditional probability matrix is given by
and the symmetry conditions are fulfilled. As a consequence, PX(x1) = PX(x2) = 1/2 will maximize the average mutual information.
The output symbol probabilities are then (total probability formula again, it is actually quite an important result from basic probability theory)
for all i. So, then the mutual information (and the capacity in this case) is given by
In this example, the capacity is quite small (when compared to the “maximum capacity” of 1 bits / channel use that could in general be achieved with binary inputs). Can you explain that intuitively ?
In the figure below, the previous capacity
is depicted as a function of p1. Notice that p1 + p2 = 0.5 or p2 = 0.5 – p1 (why?), so there really is only one “free” variable describing the channel quality in this case.
© Mikko Valkama / TUT 14 / 14