What is it about?

Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.

Featured Image

Read the Original

This page is a summary of: Word-based compression methods for large text documents, January 1999, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/dcc.1999.785680.
You can read the full text:

Read

Contributors

The following have contributed to this page