Prentice Hall, 1990. — 337 p.
Compression means making things smaller by applying pressure. Text compression is not about physically squashing text, but about finding ways to represent it in fewer bits or bytes. It is necessary that the coded form can be decompressed to reconstitute the original text (otherwise, compression is trivial!), and it is usually important that the original is recreated exactly, not approximately. This differentiates text compression from many other kinds of data reduction, such as voice or picture coding, where some degradation of quality may be tolerable if a more compact representation is thereby achieved.
Even though technology has made enormous strides in computer storage capacity, interest in text compression has never been greater-cynics might want to relate this to Parkinson's law. People often wonder if compression will continue its popularity as personal computers and workstations acquire disk storage of many megabytes, perhaps gigabytes. Evidence so far indicates that users are always interested in achieving a two- or threefold increase in capacity at no cost. To draw a rather extreme analogy, no matter how much you earned, you would probably hesitate to turn up your nose at the chance of tripling your salary!
Compression also enables full use to be made of limited transmission bandwidth. Despite impressive advances in wideband communication technology (satellites, microwaves, fiber optics), many of us communicate daily with computers over telephone channels. Long-haul networks also often transmit over the lowly telephone network. Text compression can triple effective bandwidth, reducing communication costs and end-user frustration. But what of the future? Will telephone companies lay voice/data lines right to the subscriber's doorstep and eliminate the need for compression in user-computer communication? Will cable provide domestic consumers with wideband interactive communication? Maybe, but the future seems a long time coming. And like the salary increase, threefold speed-ups might still be hard to resist.
This book introduces the subject of text compression, laying particular stress on adaptive methods-those that adjust dynamically to the characteristics of a message as it is transmitted. Of course, many of the techniques described are also able to compress non-textual data, even speech and pictures. However, we are concerned exclusively with exact compression methods, and these are often inappropriate for signals that are fundamentally analog in nature. Restricting attention specifically to text not only addresses a very important and widespread compression problem, but also frees us from having to consider how to incorporate constraints appropriate to different domains and allows us to concentrate on specific techniques for constructing models adoptively.
Text compression provides an unparalleled practical example of information theory in action. Indeed, the compressibility of English was thoroughly investigated by Shannon about the time of the birth of information theory. Since then, it has become a fascinating playground for anyone interested in methods of adaptation and prediction. Two of us (JGC and IHW) became involved with compression in the early 1980s precisely because we were seeking ways to assess methods of adaptation, or ''learning," objectively. The trouble with any adaptive system is that on any particular task it may learn, its performance will inevitably be inferior to that of a special-purpose machine designed specifically for that task. Of course, the advantage of adaptation is flexibility, but that is very hard to quantify. Text compression offers an ideal test bed for adaptive methods, for it is easy to set up realistic scenarios in which adaptive schemes outperform fixed ones (almost any real-life text compression situation will do). Then you can stop worrying about the benefits of adaptive vis-a-vis fixed models, and get on with the work of designing suitable algorithms for adaptation.
Inevitably this book is rather theoretical in places-after all, compression is information theory at work-and this may put off the practically minded reader. We hope it doesn't, for the principal messages can be plainly understood without mathematical crutches. We have tried to confine the heavier and more detailed analyses to appendices that follow some of the chapters. But there is some substantial theoretical work toward the end of Chapter 2, in much of Chapter 3, at the end of Chapter 7, and at the beginning of Chapter 9 . These parts can be safely skipped by the reader who wants to get on with the main arguments, although they make important contributions to those who seek deeper understanding. To make life easier for those who choose to skim, there is a glossary, including a summary of notation, at the end of the book to serve as a reference for the meaning of technical terms. To make life easier for those who are eager to apply compression to a particular problem at hand, a note for the practitioner at the end of Chapter 1 indicates where the most significant practical methods are described.
Why Compress Text?
Information and Models
Adaptive Models
From Probabilities To Bits
Context Modeling
State-Based Modeling
Dictionary Techniques
Choosing Your Weapon
The Wider View
A: Variable-Length Representation of the Integers
B: The Compression Corpus