Type is all you need


Introduction

There is a surprising result from Imre Csiszar’s Coding Theorems for Discrete Memoryless Systems which states the following: For any discrete memoryless channels, as long as the codewords all belong to some type class, an upperbound on the rate of the code can be established that depends only on the type. This result is significant for a variety of researches and here I attempt to demonstrate it by simplifying the proofs in the book. Note that a basic understanding of information theory is required.

Definitions

For any discrete set X, and for nN, sequence of length n is defined as x(x1,,xn)Xn. Type, which is the empirical distribution of a sequence is defined as p^x(x)=N(x|x)n=1ni=1n1{xi=x}. For any type P , type class TPn is defined as the set of all sequence xXn such that p^x(x)=P. I(PX,WY|X) is the mutual information between X and Y when X ~ PX and Y ~ xPX(x)WY|x , where WY|x is a conditional distribution on YY given xX. Message is defined as m[1;|M|] with cardinality of message set as |M|. Note that there is an inconsistency with the way I defined message m, unlike x,y, exceptionally, lowercase unbold letter is used to represent a message sequence m and set M is used to represent message sequence of length k. Notice that we use base e log for this blog.

Rate of Constant Composition Code

We attempt to show the following:

For arbitrary ϵ,τ(0,1), if (f,ψ) is an (n,ϵ)-code for some discrete memoryless channel W such that all codewords belong to TPn, then the rate of the code is bounded as:

(1)R<I(P,W)+2τ

The outline of the proof is the following: By fixing some type P such that all codewords belong the typeclass, we construct maximal code and find lower bound on the cardinality of message set by considering the conditional typicality of received codewords. Thus an upper bound on the rate can be as a function of the conditional mutual information on type P.

Proof:

Let Xn and Yn be two finite sets of input and output codewords. Define a stochastic matrix W:XY, whose entries W(y|x) should be interpreted as conditional probability mapping of every state of X to every state of Y. Thus, n channel use can be written as Wn:XnYn. Given a channel W, a code for channel can be uniquely specified by any encoder/decoder pair (f:MXn,ψ:YnM), such as we can define equivalent channel as:

T(m|m)W(y=ψ1(m)|x=f(m)).

The probability of erroneous decoding of message m can be now defined as:

em=em(W,f,ψ)mmT(m|m)=1T(m|m).

The channel coding problem is essentially the task to make |M| as large as possible while keeping the maximum probability of error e=maxmMem as low as possible. We will focus solely on discrete memoryless channel defined as Wn(y|x)=i=1nW(yi|xi). Thus, an n-length block code with maximum probability of error bounded as e(Wn,f,ψ)ϵ will be called an (n,ϵ)-code.

A set BYn is an η-image (η(0,1]) of a set AXn over W if W(B|x)ηxA , which is to say for every element in B, the probability of the event associated with the element is higher than η conditioned on all possible xA. Define the minimum cardinality of such η-image as:

gW(A,η)=minB|B|.

Let (f,ψ) be an (n,ϵ)-code satisfying f(m)ATPn (all codewords generated by encoder f from a message belong to the same type class of type P) and ψ1(m)TWn(f(m)) (if message m is received at decoder, the mapping of ψ:MYn belongs to the conditional type class of WY|X given PX).

Thus, we can define B as BmMψ1(m), which is to say set B is union of all possible received codewords such that the decoding outcome is of mM. In other words, using graphical representation, we say that B is a union of hamming spheres of all possible codewords. Now we further restrict (f,ψ) to be an (n,ϵ) maximal code, meaning that no extension on the code can be possible. With this definition, note that if we have a non-zero probability of Wn(TW(x)B|x)1ϵ, then the code is no longer maximal, as essentially we can add a new message m with codeword f(m)=x whose hamming sphere exists outside of B. Thus, it is necessary for maximal code that

Wn(TW(x)B|x)<1ϵxA.

Since provided for all nn0(|X|,|Y|,τ) (which is just a fancy way to say n is sufficiently long depending on |X|,|Y|,τ). By the definition of typicality, Wn(TW(x)|x)1τxA , it follows that Wn(B|x)>ϵτxA, and therefore by definition |B|gWn(A,ϵτ). On the other hand, by Asymptotic Equipartition Property on conditional type, we have |B||M|en(H(W|P)+τ). The following two results are thus immediate:

|M|gWn(A,ϵτ)en(H(W|P)+τ), 1nloggWn(TP,ϵτ)H(Y)τ.

Now, let BY be an (ϵ+τ)-image of AA for which |B|=gWn(A,ϵ+τ), since em=1Wn(ψ1(m)|f(m))ϵ, we have Wn(Bψ1(m)|f(m))τ for every m:ψ1(m)A. We also know that if XXn, Pn(X)η, then 1nlog|X|H(P)ϵn, and we can apply this result to conclude |Bψ1(m)|en(H(W|P)τ), and it follows that

gWn(A,ϵ+τ)=|B|m:ψ1(m)A|Bψ1(m)||M|en(H(W|P)τ),

where M={m:ψ1(m)A}. Since AA, we have

1nlog|M|<1nloggWn(A,ϵ+τ)H(W|P)τ.

Finally, note that since

|B|=gWn(A,ϵ+τ)gWn(A,ϵ+τ)en(H(W)+τ),

we finally arrive at the result, where note that the left hand side of the equation is the rate.

1nlog|M|<I(P,W)+2τ

Error Exponent of Constant Composition Code

We know that for rate region described by the above result, error tends to zero for sufficiently long n, however, it can be shown that even the rate of those convergence depend on type only. We provide both lower and upper bounds on error and error exponent. Note that an upper bound on error is a lower bound on error exponent, since the exponent involves the negative sign.

Lower Bound

The following statement is made in Csiszar’s coding theorems for discrete memoryless channels. For every R>λ>0 and every type P of sequence in Xn, there exists an n-length block code (f,ψ) of rate

1nlog|M|Rδ,

such that all codewords f(m),mM are of type P and the error can be bounded as:

(2)e(Wn,f,ψ)en((minVD(V||W|P)+|I(P,V)R|+)δ).

Here, Er(R,P,W)=minVD(V||W|P)+|I(P,V)R|+ is known as random coding error exponent of channel W with input distribution of type P, and V:XY ranges over all channels.

My opinion is that Csiszar’s proof is very obscure and therefore I will restate the main results using Gallagher’s proof.

Assume we have a fixed composition code of composition Q, and we define the typical set TQXn to be its composition. In other words

|TQXn|=exp{n[H(Q)δn]}

for δn0,n. The probability of decoding error for the m-th codeword in a code C is defined using maximum likelihood decoder as

Pe=maxmPe,m

We know that a code |C|=M can be expurgated into a code of size M2 by removing the half worst performing codewords, with a resulting maximum probability of error at most twice the average of original code. Hence, the distinction between using maximum or average error probability is minimal. If we define TQXP^Y|Xn to be the joint type of X,Y, then we have

|TQXP^Y|Xn|=exp{n[H(Q,P^)δn]}.

For a given input x of type Q, we have the conditional type in y such that (x,y)TQXP^Y|Xn as TP^Y|Qxn, then assume that xTQXn and xTQXn, the same permutation on x that results in x can permute y into yTP^Y|Qxn it can be seen that |TP^Y|Qxn|=|TP^Y|Qxn|xTQXn. Hence we can conclude

|TP^Y|Qxn|=|TQXP^Y|Xn||TQXn|.

Note that the left hand side is the conditional type, for instance |TP^Yn|=|TQXP^Y|Xn||TQXn| is not true when X and Y are correlated sources. The immediate result is

|TP^Y|Qxn|=exp{n[H(Q,P^)H(Q)δn]},

which isn’t anything new. But this gives us an operational interpretation on the elements of TP^Y|Qxn as the set of output codewords arising from x under a given noise composition P^.

Now let’s introduce the discrete memoryless channel defined by some transition matrix P, the following relation can be established:

P(y|x)=P(y1|x1)P(yn|xn)=exp{n[k,jQ(xk)P^(yj|xk)lnP(yj|xk)]}.

Note that we use Pr(.) to mean general probability and the left hand side of the above equation means the probability of receiving yTP^Y|Qxn given that xTQXn is transmitted over a channel with transition matrix P. Thus,

P[yTP^Y|Qxn|x]=|TP^Y|Qxn|P(y|x)=exp{n[k,jQ(xk)P^(yj|xk)lnP(yj|xk)P^(yj|xk)]δn}.

This can be equivalently written as:

(3)P[yTP^Y|Qxn|x]=exp{n[D(P^||P|Q)+δn]},

where D(P^||P|Q)=k,jQ(xk)P^(yj|xk)lnP^(yj|xk)P(yj|xk) and we have dropped the subscripts on probability for clarity.
The next step of argument requires a new definition, define YQXPY|XxQX(x)P^Y|X(y|x), note that this is different from the definition of joint type as the summation are over both X and Y. The operational meaning is the output composition. Assume we have some noise composition P^ for which |TP^Y|Qxn|2|C||TQXPY|Xn| (it will become clear what this assumption is made). Since the total output composition cardinality is |TQXPY|Xn|, and there are a total of |C| codewords, hence the average message m has |TQXPY|Xn||C| corresponding outputs. This is equivalent to say that there exists a m for which at most |TQXPY|Xn||C| outputs are decoded into m. Now it becomes clear that the assumption |TP^Y|Qxmn|2|C||TQXPY|Xn| means there exists some m such that half of |TP^Y|Qxmn| are not decoded into m. Recall that the error is defined as the maximum over all realizations of codewords, since for some m, half of the decoding results in error, we have:

(4)Pe12P[TP^Y|Qxn|x].

The graphical understanding of this result is that if the spheres of output sets TP^Y|Qxn are too large, they can not be packed disjointly into the output composition set TQXPY|Xn, which means decoding error.

This is reason why this lower bound on error exponent is called sphere packing bound, and here we present a bit of a digression to explore this interpretation. The information-theoretic understanding of this packing can be understood using the following formula:

|TQXPY|Xn||TP^Y|Qxn|=exp{n[k,jQ(xk)P^(yj|xk)lnP^(yj|xk)kQ(xk)P^(yj|xk)]δn}.

Now, the term in the exponent should be familiar, if we expand the mutual information between X and Y, given that they are of composition Q and P^:

I(QX,P^Y|X)=xQ(xk)yP^(yi|xk)lnP^(yi|xk)P^(yj),

from which it should be obvious that

|TQXPY|Xn||TP^Y|Qxn|=exp{n[I(QX,P^Y|X)δn]}

Thus the assumption we made earlier |TP^Y|Qxmn|2|C||TQXPY|Xn| can be written as RI(QX,P^Y|X)+δn+ln2n

Now back to the main argument, combining equation (3) and (4), we see that the error is bounded as:

Peexp{n[Esp(R+δn,QX,PY|X)δn]},

where the sphere packing error exponent is

Esp(R,QX,PY|X)=minP^:I(QX,P^Y|X)RD(P^||P|Q).

The condition I(QX,P^Y|X)R seems counterintuitive, but note that P^ is not the channel transition matrix, it can be seen that if I(QX,PY|X)R , Esp(R,QX,PY|X) is zero, because we can choose P^=P , since RI(QX,PY|X) is required to have any rate, the violation of which means error never decreases with zero exponent.

Upper Bound

Next, we try to establish the upper bound on the error exponent for constant composition codes. Now we define a new typical set TQX|Pyn, which the set of all input codewords in composition Q that can be decoded into y in the presence of some noise composition P. For some message m, codeword xm and received output y, there is an error if there is some other codeword xm in the set TQX|Pyn such that P(y|xm)P(y|xm), which is to say

(5)exp{n[k,jQ(xk)P(yj|xk)lnP(yj|xk)]}exp{n[k,jQ(xk)P^(yj|xk)lnP(yj|xk)]}.

Define P(P^) as the set of noise compositions P such that (5) is satisfied, the error probability is then

Pr(error|m,xm,y)mmPP(P^)Pr[xmTQX|Pyn].

Given that m is randomly selected out of TQXn, we have

Pr[xmTQX|Pyn]=|TQXPY|Xn||TQXn||TQXPY|Xn|exp{nI(QX,PY|X)}.

By type counting lemma, there are fewer than (n+1)|X||Y|1 joint composition of QP , and there are |C|1=enR1 choices of mm , we bound the error as:

Pr(error|m,xm,y)(enR1)(n+1)|X||Y|1maxPP(P^)Pr[xmTQX|Pyn](n+1)|X||Y|1en[I(QX,PY|X)R].

Define δn=|X||Y|1nln(n+1) , which goes to zero as n , we have

Pr(error|m,xm,y)en[I(QX,PY|X)Rδn].

Taking the maximum inside and recall that Pe,mP^Pr(error|m,xm,yTP^Y|Qxmn)exp{nD(P^||P|Q)}, and finally we have

Peexp[n(Er(R,QX,PY|X)δn)],

where

Er(R,QX,PY|X)=minP^(D(P^||P|Q)+[I(QX,P^Y|X)R]+),

is the random coding error exponent, which provides an upper bound on the error exponent of constant composition code. Once again, an upper bound on error is a lower bound on error exponent, since the exponent involves the negative sign. Hence we have the relation

Er(R,QX,PY|X)EEsp(R,QX,PY|X)

Conclusion

This result is useful in a lot of information-theoretic proofs as the error exponent of constant composition codes is also extensively discussed in literature. More recently, this result proves useful in joint sensing and communication as the sensing capacity is also heavily influenced by the type of the codeword.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Understanding all error control codes in one post
  • Fundamental Principle of OFDM Synchronization
  • Research, Ibusuki and the setting sun in Kashima
  • The importance of decentralized ego
  • December 2024 Kansai/Kyushu/Hokkaido Travel Blog