Chapter Four: Zero-Error Data Compression. In this chapter, we give an answer to the question: why the entropy of X measures the amount of information in a random variable X? We take a first look at data compression, by studying a class of codes called prefix codes. We will discuss how to construct optimal prefix codes called Huffman codes. In section 4.1, we will discuss why entropy is a fundamental bound for data compression. Definition 4.1. A D-ary source code C for a source random variable X is a mapping from script X, the alphabet of X, to D*, the set of all finite length sequences of symbols taken from a D-ary code alphabet. In other words, a source code C maps an outcome of the source random variable X to a D-ary sequence of finite length. Definition 4.2. A code is uniquely decodable if for any finite source sequence, the sequence of code symbols corresponding to this source sequence is different from the sequence of code symbols corresponding to any other finite source sequence. In other words, a uniquely decodable code does not map two different finite source sequence to the same code sequence. Here is an example. We let the alphabet be A, B, C, and D. Now we consider the code defined by the following table. For the source symbol A, the code maps it to the sequence 0. For the source symbol B, the code maps it to 1. And for source symbol C, the code maps it to 01. And for source symbol D, the code maps it to 10. Then AAD would be mapped to 0010, because A maps to 0, A maps to 0, D maps to 10. So, AAD maps to 0010. Likewise, ACA will be mapped to 0010. And AABA also is mapped to 0010. Therefore, this code C is not uniquely decodable, because it maps three different source sequences into the same code sequence. In other words, from the code sequence, it may not be possible to recover the source sequence. Theorem 4.4 is the Kraft inequality which is a fundamental characterization of a uniquely decodable code. Let C be a D-ary source code, and let l_1, l_2, up to l_m, be the lengths of the codewords. If C is uniquely decodable, then summation D to the power minus l_k, for k equals 1 up to m, is less than or equal to 1. Example 4.5. In this example, we illustrate a technique to be used in the proof of theorem 4.4. Let l_1, and l_2 be 1, and l_3 and l_4 be 2. These are the lengths of the codewords in C. Now consider the polynomial summation 2 to the power minus l_k, where in this case D is equal to 2, for k equals 1 up to 4, because there are four codewords. So the polynomial is equal to 2 to the power minus 1 plus 2 to the power minus 1 plus 2 to the power minus 2 plus 2 to the power minus 2. And we consider raising this polynomial to the N-th power. For N equals 2 we have, this polynomial multiplied by itself. And we get 4 times 2 to the power minus 2 plus 8 times 2 to the power minus 3 and 4 times 2 to the power minus 4, where this coefficient 4 for 2 to the power minus 2, comes from the product of this and this, the product of this and this, the product of this and this and the product of this and this. We are going to call these coefficients A_2, A_3, and A_4, respectively. Namely A_2 is equal to 4, A_3 is equal to 8, and A_4 is equal to 4. Then A_2 equals 4 is the total number of sequences of 2 codewords with a total length of 2 code symbols. Specifically, the 4 sequences are 00 for the source sequence AA, 01 for the source sequence AB, 10 for the source sequence BA and 11 for the source sequence BB. This is so because when we multiply 2 to the power minus 1 to 2 to the power minus 1, we are concatenating a length 1 and a length 1 to a length 2. Similarly A_3 equals 8 and A_4 equals 4 are the total number of sequences of 2 codewords with a total length of 3 and 4 code symbols respectively. As an exercise: verify that A_3 is equal to 8 and list the 8 sequences of 2 codewords with a total length of 3 code symbols. We now prove the Kraft Inequality. Without loss of generality, assume that the codeword lengths are indexed according to the ascending order. Let N be an arbitrary positive integer. Keep in mind that N is the length of the source sequence. And consider, the polynomial summation D to the power minus l_k, k from 1 up to m, to the N-th power. Writing this out, it becomes summation k_1 equals 1 up to m, summation k_2 equals 1 up to m, all the way to summation k_N equals 1 up to m, where k_1, k_2, up to k_N are dummy variables, D to the power minus l_{k_1} plus l_{k_2}, all the way to l_{k_N}. By collecting terms of the same degree on the right hand side, we write summation A_i, D to the power minus i, i from 1 up to N times l_m, where l_m is the largest codeword length, where A_i is the coefficient of D to the power minus i, on the left hand side. Now, observe that A_i gives the total number of sequences of N codewords, with a total length of i code symbols. This has been discussed in example 4.5 just a moment ago. Since the code is uniquely decodable, these code sequences must be distinct. And therefore A_i is less than or equal to D to the power i, because there are D to the power i distinct sequences of i code symbols. Substituting 3 into 2, we have the N-th power of the polynomial less than or equal to summation i from 1 up to N times l_m, D to the power i times D to the power minus i, which is equal to 1 and the summation is equal to N times l_m. Taking the N-th root on both sides, we obtain the polynomial less than or equal to N times l_m to the power 1 over N. Since this inequality holds for any N, recall that N is the length of the source sequence, upon letting N goes to infinity, the right hand side tends to 1. And we have obtained the Kraft inequality. We now discuss the expected length of a code. Let the distribution of the source random variable be p_1, p_2, up to p_m. The expected length of a code C is denoted by L which is defined by summation p_i, l_i over all i. Intuitively, for uniquely decodable code C, the entropy of X with the base of the logarithm taken to be D is less than or equal to L, because from the codeword, the random variable X can be recovered and each D-ary symbol can carry at most 1 D-it of information. This will be made precise in the next theorem. Theorem 4.6 is the Entropy Bound. Let C be a D-ary uniquely decodable code for a source random variable X with entropy H_D of X. Then the expected length of C is lower bounded by H_D of X. That is, L is greater than or equal to H_D of X. This lower bound is tight if and only if, l_i is equal to minus log D p_i for all i. Let us now prove the Entropy Bound. First, since C is uniquely decodable, the lengths of its codewords satisfy the Kraft Inequality. Write L equals summation i p_i l_i. For l_i we write it as log D, D to the power l_i. By taking the logarithm of D to the power l_i in the base D, we get back l_i. And recall that H_D of X is equal to minus summation i p_i log D p_i. Then, L minus H_D of X is equal to summation i p_i log D p_i plus log D, D to the power l_i. Now log D p_i plus log D D to the power l_i becomes log D, p_i D_i to the power l_i. Now change the logarithm in the base D to the natural logarithm by multiplying natural log D to the power minus 1. By the Fundamental Inequality log a greater than or equal to 1 minus 1 over a, with a equals p_i, D to the power l_i, we have log of p_i D to the power l_i, greater than or equal to 1 minus 1 over p_i, D to the power l_i. This p_i and this p_i cancel with each other, and so we get p_i minus D to the power minus l_i. Then we can write the summation as summation i p_i minus summation i, D to the power minus l_i, where summation i p_i, is equal to 1. And summation i D to the power minus l_i, by the Kraft Inequality, is less than or equal to 1. So we obtain greater than or equal to log D to the power minus 1, 1 minus 1, which is equal to 0. This proves the entropy bound. Now in order for this bound to be tight, the inequality in both 2 and 3 have to be tight simultaneously. Now 2 is tight if and only if, p_i D to the power l_i is equal to 1, or is equal to minus log D p_i for all i. If this holds, we have summation i D to the power minus l_i equals summation i, D to the power log D p_i, because l_i is equal to minus log D p_i, D to the power log D p_i is equal to p_i. And so, we have summation i p_i equals 1. So we get summation i D to the power minus l_i equals 1, that is the inequality in 3 is also tight. This completes the proof of the theorem. As a corollary of the entropy bound, we now give an alternative proof of theorem 2.43, which says that the entropy of X is less than or equal to the log of the size of the alphabet X. We let the alphabet X be 0, 1, up to the size of X minus 1. Let C be the identity code, that is, it maps a source symbol to itself. Evidently, C is an X-ary uniquely decodable code, with expected length equals 1. By the entropy bound, we have L, which is equal to 1, is greater than or equal to entropy of X in the base the size of the alphabet X. Leaving the base of the logarithm unspecified, we have the entropy of X less than or equal to the log of the size of the alphabet, recovering theorem 2.43. We end this section with a definition. The redundancy R of a D-ary uniquely decodable code is the difference between the expected length of the code and the entropy of the source. By the entropy bound we immediately have R equals L minus H_D of X is greater than or equal to 0. That is the redundancy of a code is always non-negative.