Introduction
There is a surprising result from Imre Csiszar’s Coding Theorems for Discrete Memoryless Systems which states the following: For any discrete memoryless channels, as long as the codewords all belong to some type class, an upperbound on the rate of the code can be established that depends only on the type. This result is significant for a variety of researches and here I attempt to demonstrate it by simplifying the proofs in the book. Note that a basic understanding of information theory is required.
Definitions
For any discrete set , and for , sequence of length is defined as . Type, which is the empirical distribution of a sequence is defined as . For any type , type class is defined as the set of all sequence such that . is the mutual information between and when ~ and ~ , where is a conditional distribution on given . Message is defined as with cardinality of message set as . Note that there is an inconsistency with the way I defined message , unlike , exceptionally, lowercase unbold letter is used to represent a message sequence and set is used to represent message sequence of length . Notice that we use base log for this blog.
Rate of Constant Composition Code
We attempt to show the following:
For arbitrary , if is an -code for some discrete memoryless channel such that all codewords belong to , then the rate of the code is bounded as:
The outline of the proof is the following: By fixing some type such that all codewords belong the typeclass, we construct maximal code and find lower bound on the cardinality of message set by considering the conditional typicality of received codewords. Thus an upper bound on the rate can be as a function of the conditional mutual information on type .
Proof:
Let and be two finite sets of input and output codewords. Define a stochastic matrix , whose entries should be interpreted as conditional probability mapping of every state of to every state of . Thus, channel use can be written as . Given a channel , a code for channel can be uniquely specified by any encoder/decoder pair , such as we can define equivalent channel as:
The probability of erroneous decoding of message can be now defined as:
The channel coding problem is essentially the task to make as large as possible while keeping the maximum probability of error as low as possible. We will focus solely on discrete memoryless channel defined as Thus, an -length block code with maximum probability of error bounded as will be called an -code.
A set is an -image () of a set over if , which is to say for every element in , the probability of the event associated with the element is higher than conditioned on all possible . Define the minimum cardinality of such -image as:
Let be an -code satisfying (all codewords generated by encoder from a message belong to the same type class of type ) and (if message is received at decoder, the mapping of belongs to the conditional type class of given ).
Thus, we can define as which is to say set is union of all possible received codewords such that the decoding outcome is of . In other words, using graphical representation, we say that is a union of hamming spheres of all possible codewords. Now we further restrict to be an maximal code, meaning that no extension on the code can be possible. With this definition, note that if we have a non-zero probability of then the code is no longer maximal, as essentially we can add a new message with codeword whose hamming sphere exists outside of . Thus, it is necessary for maximal code that
Since provided for all (which is just a fancy way to say is sufficiently long depending on ). By the definition of typicality, , it follows that , and therefore by definition . On the other hand, by Asymptotic Equipartition Property on conditional type, we have . The following two results are thus immediate:
Now, let be an -image of for which , since , we have for every . We also know that if , , then , and we can apply this result to conclude , and it follows that
where . Since , we have
Finally, note that since
we finally arrive at the result, where note that the left hand side of the equation is the rate.
Error Exponent of Constant Composition Code
We know that for rate region described by the above result, error tends to zero for sufficiently long , however, it can be shown that even the rate of those convergence depend on type only. We provide both lower and upper bounds on error and error exponent. Note that an upper bound on error is a lower bound on error exponent, since the exponent involves the negative sign.
Lower Bound
The following statement is made in Csiszar’s coding theorems for discrete memoryless channels. For every and every type of sequence in , there exists an -length block code of rate
such that all codewords are of type and the error can be bounded as:
Here, is known as random coding error exponent of channel with input distribution of type , and ranges over all channels.
My opinion is that Csiszar’s proof is very obscure and therefore I will restate the main results using Gallagher’s proof.
Assume we have a fixed composition code of composition , and we define the typical set to be its composition. In other words
for . The probability of decoding error for the -th codeword in a code is defined using maximum likelihood decoder as
We know that a code can be expurgated into a code of size by removing the half worst performing codewords, with a resulting maximum probability of error at most twice the average of original code. Hence, the distinction between using maximum or average error probability is minimal. If we define to be the joint type of , then we have
For a given input of type , we have the conditional type in such that as , then assume that and , the same permutation on that results in can permute into it can be seen that . Hence we can conclude
Note that the left hand side is the conditional type, for instance is not true when and are correlated sources. The immediate result is
which isn’t anything new. But this gives us an operational interpretation on the elements of as the set of output codewords arising from under a given noise composition .
Now let’s introduce the discrete memoryless channel defined by some transition matrix , the following relation can be established:
Note that we use to mean general probability and the left hand side of the above equation means the probability of receiving given that is transmitted over a channel with transition matrix . Thus,
This can be equivalently written as:
where and we have dropped the subscripts on probability for clarity.
The next step of argument requires a new definition, define , note that this is different from the definition of joint type as the summation are over both and . The operational meaning is the output composition. Assume we have some noise composition for which (it will become clear what this assumption is made). Since the total output composition cardinality is , and there are a total of codewords, hence the average message has corresponding outputs. This is equivalent to say that there exists a for which at most outputs are decoded into . Now it becomes clear that the assumption means there exists some such that half of are not decoded into . Recall that the error is defined as the maximum over all realizations of codewords, since for some , half of the decoding results in error, we have:
The graphical understanding of this result is that if the spheres of output sets are too large, they can not be packed disjointly into the output composition set , which means decoding error.
This is reason why this lower bound on error exponent is called sphere packing bound, and here we present a bit of a digression to explore this interpretation. The information-theoretic understanding of this packing can be understood using the following formula:
Now, the term in the exponent should be familiar, if we expand the mutual information between and , given that they are of composition and :
from which it should be obvious that
Thus the assumption we made earlier can be written as
Now back to the main argument, combining equation (3) and (4), we see that the error is bounded as:
where the sphere packing error exponent is
The condition seems counterintuitive, but note that is not the channel transition matrix, it can be seen that if , is zero, because we can choose , since is required to have any rate, the violation of which means error never decreases with zero exponent.
Upper Bound
Next, we try to establish the upper bound on the error exponent for constant composition codes. Now we define a new typical set , which the set of all input codewords in composition that can be decoded into in the presence of some noise composition . For some message , codeword and received output , there is an error if there is some other codeword in the set such that , which is to say
Define as the set of noise compositions such that (5) is satisfied, the error probability is then
Given that is randomly selected out of , we have
By type counting lemma, there are fewer than joint composition of , and there are choices of , we bound the error as:
Define , which goes to zero as , we have
Taking the maximum inside and recall that and finally we have
where
is the random coding error exponent, which provides an upper bound on the error exponent of constant composition code. Once again, an upper bound on error is a lower bound on error exponent, since the exponent involves the negative sign. Hence we have the relation
Conclusion
This result is useful in a lot of information-theoretic proofs as the error exponent of constant composition codes is also extensively discussed in literature. More recently, this result proves useful in joint sensing and communication as the sensing capacity is also heavily influenced by the type of the codeword.