<< Chapter < Page | Chapter >> Page > |
Because of the AEP [link] ,
Therefore, for a typical input $Pr\left({x}_{1}^{l}\right)>{2}^{-{l}_{0}(H+\u03f5)}$ .
Recall that the interval length is ${l}_{0}=\frac{log\left({n}_{w}\right)}{H+2\u03f5},$ and so the probability that an interval cannot be found in the history is
For a long enough interval, this probability goes to zero.
There are many Ziv-Lempel style parsing algorithms [link] , [link] , [link] , and each of the variantshas different details, but the key idea is to find the longest match in a window of length ${n}_{w}$ . The length of the match is $L$ , where we remind the reader that $L\approx \frac{log\left({n}_{w}\right)}{H}$ .
Now, encoding $L$ requires $log\left({n}_{w}\right)$ bits, and so the per-symbol compression ratio is $\frac{log\left({n}_{w}\right)}{L}$ , which in the limit of large ${n}_{w}$ approaches the entropy rate $H$ .
However, the encoding of $L$ must also desribe its length, and often the symbol that follows the match. These require length $log\left(L\right)\approx log(log\left({n}_{w}\right))$ , and the normalized (per-symbol) cost is
Therefore, the redundancy of Ziv-Lempel style compression algorithms is proportional to $O\left(\frac{log(log(n\left)\right)}{log\left(n\right)}\right)$ , which is much greater than the $O\left(\frac{log\left(n\right)}{n}\right)$ that we have seen for parametric sources. The fundamental reason why the redundancy is greater is that the class of non-parametric sources is much richer. Detailedredundancy analyses appear in a series of papers by Savari (c.f. [link] ).
The parsing schemes that we have seen can also be adapted to lossy compression. Let us describe several approaches along these lines.
Fixed length: The first scheme, due to Gupta et al. [link] , constructs a codebook of size $\approx {2}^{LR\left(D\right)}$ codewords, where $L$ is the length of the phrase being matched and $R\left(D\right)$ is the rate distortion function. The algorithm cannot search for perfect matches of the phrase, because this is lossy compression. Instead, it seeks the codeword that matches our input phrase most closely. It turns out that for large $L$ the expected distortion of the lossy match will be approximately $D$ per symbol.
Variable length: Another approach, due to Gioran and Kontoyiannis [link] , constructs a single long database string, and searches for the longest match whose distortion w.r.t. the input is approximately $D$ ; the location and length of the approximate match are encoded. Seeing that the database is of length $\approx {2}^{LR}$ , encoding the location requires $\approx LR$ bits, and the $D$ -match (a match with distortion $D$ w.r.t. the input string) is typically of length $\approx L$ , giving a per-symbol rate of $\approx R\left(D\right)$ bits.
An advantage of the latter scheme by Gioran and Kontoyiannis [link] is reduced memory use. The database is a string of length $\approx {2}^{LR\left(D\right)}$ , instead of a codebook comprised of $\approx {2}^{LR\left(D\right)}$ codewords, each of length $L$ . On the other hand, the Gupta et al. algorithm [link] has better $RD$ performance, because it does not need to spend $\approx log\left(L\right)$ bits per phrase to describe its length. An improved algorithm, dubbed the hybrid algorithm by Gioran and Kontoyiannis, constructs a single database and performs fixed length coding for the best match of length $L$ in the database. Therefore, it combines the memory usage of a single database approach with the $RD$ performance of fixed length coding.
Notification Switch
Would you like to follow the 'Universal algorithms in signal processing and communications' conversation and receive update notifications?