<< Chapter < Page Chapter >> Page >

Because of the AEP  [link] ,

Pr x 1 l : - log ( P r ( x 1 l ) ) l 0 - H > ϵ 0 .

Therefore, for a typical input Pr ( x 1 l ) > 2 - l 0 ( H + ϵ ) .

Recall that the interval length is l 0 = log ( n w ) H + 2 ϵ , and so the probability that an interval cannot be found in the history is

2 - l 0 ( H + 2 ϵ ) 2 - l 0 ( H + ϵ ) = 2 - l 0 ϵ 0 .

For a long enough interval, this probability goes to zero.

Redundancy of parsing schemes

There are many Ziv-Lempel style parsing algorithms  [link] , [link] , [link] , and each of the variantshas different details, but the key idea is to find the longest match in a window of length n w . The length of the match is L , where we remind the reader that L log ( n w ) H .

Now, encoding L requires log ( n w ) bits, and so the per-symbol compression ratio is log ( n w ) L , which in the limit of large n w approaches the entropy rate H .

However, the encoding of L must also desribe its length, and often the symbol that follows the match. These require length log ( L ) log ( log ( n w ) ) , and the normalized (per-symbol) cost is

log ( log ( n w ) ) L = O log ( log ( n w ) ) log ( n w ) .

Therefore, the redundancy of Ziv-Lempel style compression algorithms is proportional to O log ( log ( n ) ) log ( n ) , which is much greater than the O log ( n ) n that we have seen for parametric sources. The fundamental reason why the redundancy is greater is that the class of non-parametric sources is much richer. Detailedredundancy analyses appear in a series of papers by Savari (c.f.  [link] ).

Parsing for lossy compression

The parsing schemes that we have seen can also be adapted to lossy compression. Let us describe several approaches along these lines.

Fixed length: The first scheme, due to Gupta et al.  [link] , constructs a codebook of size 2 L R ( D ) codewords, where L is the length of the phrase being matched and R ( D ) is the rate distortion function. The algorithm cannot search for perfect matches of the phrase, because this is lossy compression. Instead, it seeks the codeword that matches our input phrase most closely. It turns out that for large L the expected distortion of the lossy match will be approximately D per symbol.

Variable length: Another approach, due to Gioran and Kontoyiannis  [link] , constructs a single long database string, and searches for the longest match whose distortion w.r.t. the input is approximately D ; the location and length of the approximate match are encoded. Seeing that the database is of length 2 L R , encoding the location requires L R bits, and the D -match (a match with distortion D w.r.t. the input string) is typically of length L , giving a per-symbol rate of R ( D ) bits.

An advantage of the latter scheme by Gioran and Kontoyiannis  [link] is reduced memory use. The database is a string of length 2 L R ( D ) , instead of a codebook comprised of 2 L R ( D ) codewords, each of length L . On the other hand, the Gupta et al. algorithm  [link] has better R D performance, because it does not need to spend log ( L ) bits per phrase to describe its length. An improved algorithm, dubbed the hybrid algorithm by Gioran and Kontoyiannis, constructs a single database and performs fixed length coding for the best match of length L in the database. Therefore, it combines the memory usage of a single database approach with the R D performance of fixed length coding.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Universal algorithms in signal processing and communications. OpenStax CNX. May 16, 2013 Download for free at http://cnx.org/content/col11524/1.1
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Universal algorithms in signal processing and communications' conversation and receive update notifications?

Ask